» Articles » PMID: 40057556

The QCML Dataset, Quantum Chemistry Reference Data from 33.5M DFT and 14.7B Semi-empirical Calculations

Overview
Journal Sci Data
Specialty Science
Date 2025 Mar 8
PMID 40057556
Authors
Affiliations
Soon will be listed here.
Abstract

Machine learning (ML) methods enable prediction of the properties of chemical structures without computationally expensive ab initio calculations. The quality of such predictions depends on the reference data that was used to train the model. In this work, we introduce the QCML dataset: A comprehensive dataset for training ML models for quantum chemistry. The QCML dataset systematically covers chemical space with small molecules consisting of up to 8 heavy atoms and includes elements from a large fraction of the periodic table, as well as different electronic states. Starting from chemical graphs, conformer search and normal mode sampling are used to generate both equilibrium and off-equilibrium 3D structures, for which various properties are calculated with semi-empirical methods (14.7 billion entries) and density functional theory (33.5 million entries). The covered properties include energies, forces, multipole moments, and other quantities, e.g., Kohn-Sham matrices. We provide a first demonstration of the utility of our dataset by training ML-based force fields on the data and applying them to run molecular dynamics simulations.

Citing Articles

The QCML dataset, Quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations.

Ganscha S, Unke O, Ahlin D, Maennel H, Kashubin S, Muller K Sci Data. 2025; 12(1):406.

PMID: 40057556 PMC: 11890765. DOI: 10.1038/s41597-025-04720-7.

References
1.
Ambrogelly A, Palioura S, Soll D . Natural expansion of the genetic code. Nat Chem Biol. 2006; 3(1):29-35. DOI: 10.1038/nchembio847. View

2.
Hermann J, Tkatchenko A . Density Functional Model for van der Waals Interactions: Unifying Many-Body Atomic Approaches with Nonlocal Functionals. Phys Rev Lett. 2020; 124(14):146401. DOI: 10.1103/PhysRevLett.124.146401. View

3.
Noe F, Olsson S, Kohler J, Wu H . Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science. 2019; 365(6457). DOI: 10.1126/science.aaw1147. View

4.
Gebauer N, Gastegger M, Hessmann S, Muller K, Schutt K . Inverse design of 3d molecular structures with conditional generative neural networks. Nat Commun. 2022; 13(1):973. PMC: 8861047. DOI: 10.1038/s41467-022-28526-y. View

5.
Unke O, Stohr M, Ganscha S, Unterthiner T, Maennel H, Kashubin S . Biomolecular dynamics with machine-learned quantum-mechanical force fields trained on diverse chemical fragments. Sci Adv. 2024; 10(14):eadn4397. PMC: 11809612. DOI: 10.1126/sciadv.adn4397. View