» Articles » PMID: 39572011

Data Generation for Machine Learning Interatomic Potentials and Beyond

Overview
Journal Chem Rev
Specialty Chemistry
Date 2024 Nov 21
PMID 39572011
Authors
Affiliations
Soon will be listed here.
Abstract

The field of data-driven chemistry is undergoing an evolution, driven by innovations in machine learning models for predicting molecular properties and behavior. Recent strides in ML-based interatomic potentials have paved the way for accurate modeling of diverse chemical and structural properties at the atomic level. The key determinant defining MLIP reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture specific domains in the vast chemical and structural space. This Review navigates the intricate landscape of essential components and integrity of training data that ensure the extensibility and transferability of the resulting models. We delve into the details of active learning, discussing its various facets and implementations. We outline different types of uncertainty quantification applied to atomistic data acquisition and the correlations between estimated uncertainty and true error. The role of atomistic data samplers in generating diverse and informative structures is highlighted. Furthermore, we discuss data acquisition via modified and surrogate potential energy surfaces as an innovative approach to diversify training data. The Review also provides a list of publicly available data sets that cover essential domains of chemical space.

Citing Articles

Computational tools for the prediction of site- and regioselectivity of organic reactions.

Sigmund L, Assante M, Johansson M, Norrby P, Jorner K, Kabeshov M Chem Sci. 2025; .

PMID: 40070469 PMC: 11891785. DOI: 10.1039/d5sc00541h.

References
1.
Ruddigkeit L, van Deursen R, Blum L, Reymond J . Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model. 2012; 52(11):2864-75. DOI: 10.1021/ci300415d. View

2.
Zhu A, Batzner S, Musaelian A, Kozinsky B . Fast uncertainty estimates in deep learning interatomic potentials. J Chem Phys. 2023; 158(16). DOI: 10.1063/5.0136574. View

3.
Hirai K, Itoh T, Tomioka H . Persistent triplet carbenes. Chem Rev. 2009; 109(8):3275-332. DOI: 10.1021/cr800518t. View

4.
Zubatyuk R, Smith J, Leszczynski J, Isayev O . Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network. Sci Adv. 2019; 5(8):eaav6490. PMC: 6688864. DOI: 10.1126/sciadv.aav6490. View

5.
Ko T, Finkler J, Goedecker S, Behler J . General-Purpose Machine Learning Potentials Capturing Nonlocal Charge Transfer. Acc Chem Res. 2021; 54(4):808-817. DOI: 10.1021/acs.accounts.0c00689. View