Guidelines for Method ComparisonRead the first pre-print from the Small Molecule Steering Committee

This dataset has not yet been certified by approved reviewers. It may contain issues related to data completeness and quality.

Dataset

graphium/zinc12k-v1

A subset (12K) of ZINC molecular graphs (250K) dataset.

Created on: July 17, 2024Dataset size: 620 KBNumber of datapoints: 12,000
Public

Tags

Graph

Modalities

MOLECULE

Details

README

Background

ZINC is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in ready-to-dock, 3D formats. ZINC also contains over 750 million purchasable compounds that can be searched for analogs. ZINC12K contains a 12,000 sample subset of ZINC molecular graphs.

Assay information

ZINC was loaded from 134 commercial supplier catalogs and 36 annotated catalogs (Table 1). If a salt, the largest organic component is taken, and molecules containing an atom other than H, C, N, O, F, S, P, Si, Cl, Br, or I are removed, a limitation due to the use of the Merck Forcefield MMFF94. Only molecules passing the primary filtering rules are loaded. Filtering rules are implemented in OpenEye’s OEChem (19) and are listed in text and graphical form (20) at filtering.docking.org. Molecules are prepared and physical properties calculated according to the protocol detailed in the ZINC paper.

Description of readout:

  • SA: Synthetic accessibility score.
  • LogP: Log P, octanol-water partition coefficient.
  • Score: constrained solubility which is the term logP − SA − cycle (octanol-water partition coefficients, logP, penalized by the synthetic accessibility score, SA, and number of long cycles, cycle).

Data resource

References:

User Attributes

These are custom, user-defined attributes that are not required by the Polaris data model.

AttributeValue
year2022