» Articles » PMID: 38648077

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data

Overview
Date 2024 Apr 22
PMID 38648077
Authors
Affiliations
Soon will be listed here.
Abstract

Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

Citing Articles

Machine learning-guided strategies for reaction conditions design and optimization.

Chen L, Li Y Beilstein J Org Chem. 2024; 20:2476-2492.

PMID: 39376489 PMC: 11457048. DOI: 10.3762/bjoc.20.212.


Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis.

Schmid S, Schlosser L, Glorius F, Jorner K Beilstein J Org Chem. 2024; 20:2280-2304.

PMID: 39290209 PMC: 11406055. DOI: 10.3762/bjoc.20.196.

References
1.
Struebing H, Ganase Z, Karamertzanis P, Siougkrou E, Haycock P, Piccione P . Computer-aided molecular design of solvents for accelerated reaction kinetics. Nat Chem. 2013; 5(11):952-7. DOI: 10.1038/nchem.1755. View

2.
Ucak U, Ashyrmamatov I, Ko J, Lee J . Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nat Commun. 2022; 13(1):1186. PMC: 8897428. DOI: 10.1038/s41467-022-28857-w. View

3.
Lee A, Yang Q, Sresht V, Bolgar P, Hou X, Klug-McLeod J . Molecular Transformer unifies reaction prediction and retrosynthesis across pharma chemical space. Chem Commun (Camb). 2019; 55(81):12152-12155. DOI: 10.1039/c9cc05122h. View

4.
Gao H, Struble T, Coley C, Wang Y, Green W, Jensen K . Using Machine Learning To Predict Suitable Conditions for Organic Reactions. ACS Cent Sci. 2018; 4(11):1465-1476. PMC: 6276053. DOI: 10.1021/acscentsci.8b00357. View

5.
Schwaller P, Petraglia R, Zullo V, Nair V, Haeuselmann R, Pisoni R . Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem Sci. 2021; 11(12):3316-3325. PMC: 8152799. DOI: 10.1039/c9sc05704h. View