Quality Datasets
Datasets 101
Below, we outline basic checks that we encourage all members of the community to follow when curating or working with drug discovery datasets. This is the first of many resources that we aim to release to help the community develop methods that matter.
The dataset is representative of applications in real-world drug discovery
Creators of the dataset must be able to explain the data generation process and describe the specific applications of this dataset in drug discovery.
Take for example the FreeSolv dataset in MoleculeNet mentioned in Pat Walter's blog. Although the dataset was designed to evaluate molecular dynamics methods, it has turned into a generic property prediction task for the free energy of solvation. However, this quantity used in isolation is not particularly useful.
The dataset stems from a consistent, original source
Creators of the dataset must share references to where the dataset was originally sourced from. If data is aggregated from multiple sources or preprocessed in some way, this process needs to be transparent and the rationale should be well documented. Blindly combining datasets can introduce significant noise.
Some examples that violate this rule include datasets like tdcommons/solubility-aqsoldb and tdcommons/bbb-martins. In both cases, data has been collected from multiple sources yet there are no references to primary literature.
The dataset does not contain obvious errors or ambiguous data
Creators of the dataset should check for obvious duplicates, invalid data, or ambiguous data. They should also visualize the data distributions to highlight potential outliers.
For example, tdcommons/bbb-martins violates this rule as it contains many duplicate structures.
Introducing Certified Datasets
Build with Confidence
Explore our first certified datasets. You can find notebooks outlining curation details in Polaris Recipes.
Guidelines
Data curation can be very nuanced depending on the specific modality you’re working with. That’s why we’re building steering committees comprised of industry experts, starting with small molecules, to provide detailed guidelines on dataset curation, method evaluation & comparison.
In the Backlog
Interested in helping foster the development of more impactful ML methods in your domain? Contact us if you're interested in joining the steering committee of other modalities in drug discovery.