Background
PM6_83M dataset is similar to the PCQM4M and comes from the same PubChemQC project. However, it uses the PM6 semi-empirical computation of the quantum properties, which is orders of magnitude faster than DFT computation at the expense of less accuracy.
This dataset covers 83M unique molecules, 62 graph-level tasks, and 7 node-level tasks. To our knowledge, this is the largest dataset available for training 2D-GNNs regarding the number of unique molecules. The various tasks come from four different molecular states, namely S0
for the ground state, T0
for the lowest energy triplet excited state, cation
for the positively charged state, and anion
for the negatively charged state. In total, there are 221M PM6 computations.
Here , pm6-subset-v1
is a subset of PM6_83M
dataset which includes ~8.5M molecules and their quantum properties.