ASAP Discovery x OpenADMET CompetitionTake part in the first prospective benchmark on Polaris.

Dataset

recursion/rxrx-compound-gene-activity

Mapping the mechanisms by which drugs exert their actions is an important challenge in advancing the use of high-dimensional biological data like phenomics. This dataset holds associations between various small molecules and genes. It is designed to accompany the RxRx3-Core dataset and OpenPhenomc-S/16 model.

Created on: December 12, 2024Dataset size: 53 KBNumber of datapoints: 18,681
Public

Status

Uncertified

This artifact has not been certified by approved reviewers. It may contain issues related to data quality.

Learn more here.

Tags

gene
compound

Modalities

No modalities found.

Details

README

Recursion logo

Compound Gene Activity

This dataset was started with 2921 associations curated from public databases, such as ChEMBL, BindingDB and US Patents. For associations with multiple IC50 or EC50 values, we took the median.

We turned this into a binary classification task as follows:

  • A positive pair is any relationship that has an activity value < 1000 nM in the original dataset.
  • A negative pair is defined in one of two ways:
    • It is either an annotated relationship that has an activity value > 10,000 nM` in the original dataset.
    • Or it is a randomly sampled gene for which a compound has no annotated association in the original dataset with an activity value < 10,000 nM. Since the genes in this dataset are well characterized, we consider it unlikely that many of these compounds act against other genes in the set.

In total, this leads to 17,140 classified relationships (out of 18,681 relationships total). So the 18,681 - 17,140 = 1541 associations were "indecisive", meaning they were in the original 2921 but had an activity >1,000nM and <10,000nM. We've left these in the dataset for completeness sake and reproducibility.