Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set

Overview

Journal Mol Pharm

Publisher American Chemical Society

Specialty Pharmacology

Date 2024 Sep 13

PMID 39267585

Authors

Jiaxi Zhao

Eline Hermans

Kia Sepassi

Christophe Tistaert

Christel A S Bergstrom

Mazen Ahmad

Per Larsson

Affiliations

Soon will be listed here.

Abstract

Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log  to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log  ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.

References

Bergstrom C, Larsson P . Computational prediction of drug solubility in water-based systems: Qualitative and quantitative approaches used in the current drug discovery and development setting. Int J Pharm. 2018; 540(1-2):185-193. PMC: 5861307. DOI: 10.1016/j.ijpharm.2018.01.044. View

Lundberg S, Erion G, Chen H, DeGrave A, Prutkin J, Nair B . From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell. 2020; 2(1):56-67. PMC: 7326367. DOI: 10.1038/s42256-019-0138-9. View

Panapitiya G, Girard M, Hollas A, Sepulveda J, Murugesan V, Wang W . Evaluation of Deep Learning Architectures for Aqueous Solubility Prediction. ACS Omega. 2022; 7(18):15695-15710. PMC: 9096921. DOI: 10.1021/acsomega.2c00642. View

Llinas A, Oprisiu I, Avdeef A . Findings of the Second Challenge to Predict Aqueous Solubility. J Chem Inf Model. 2020; 60(10):4791-4803. DOI: 10.1021/acs.jcim.0c00701. View

Shields B, Stevens J, Li J, Parasram M, Damani F, Martinez Alvarado J . Bayesian reaction optimization as a tool for chemical synthesis. Nature. 2021; 590(7844):89-96. DOI: 10.1038/s41586-021-03213-y. View

Horter D, Dressman J . Influence of physicochemical properties on dissolution of drugs in the gastrointestinal tract. Adv Drug Deliv Rev. 2001; 46(1-3):75-87. DOI: 10.1016/s0169-409x(00)00130-7. View

Lee S, Lee M, Gyak K, Kim S, Kim M, Min K . Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks. ACS Omega. 2022; 7(14):12268-12277. PMC: 9016862. DOI: 10.1021/acsomega.2c00697. View

Llinas A, Glen R, Goodman J . Solubility challenge: can you predict solubilities of 32 molecules using a database of 100 reliable measurements?. J Chem Inf Model. 2008; 48(7):1289-303. DOI: 10.1021/ci800058v. View

Zhou Z, Li X, Zare R . Optimizing Chemical Reactions with Deep Reinforcement Learning. ACS Cent Sci. 2018; 3(12):1337-1344. PMC: 5746857. DOI: 10.1021/acscentsci.7b00492. View

10.

Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P . Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. J Chem Inf Model. 2023; 63(11):3263-3274. DOI: 10.1021/acs.jcim.3c00160. View

11.

Palmer D, Mitchell J . Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules?. Mol Pharm. 2014; 11(8):2962-72. DOI: 10.1021/mp500103r. View

12.

Narayanan H, Dingfelder F, Morales I, Patel B, Heding K, Rose Bjelke J . Design of Biopharmaceutical Formulations Accelerated by Machine Learning. Mol Pharm. 2021; 18(10):3843-3853. DOI: 10.1021/acs.molpharmaceut.1c00469. View

13.

Gentile F, Yaacoub J, Gleave J, Fernandez M, Ton A, Ban F . Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking. Nat Protoc. 2022; 17(3):672-697. DOI: 10.1038/s41596-021-00659-2. View

14.

Avdeef A, Kansy M . Predicting Solubility of Newly-Approved Drugs (2016-2020) with a Simple ABSOLV and GSE() Consensus Model Outperforming Random Forest Regression. J Solution Chem. 2022; 51(9):1020-1055. PMC: 8818506. DOI: 10.1007/s10953-022-01141-7. View

15.

Delaney J . ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci. 2004; 44(3):1000-5. DOI: 10.1021/ci034243x. View

16.

Kalepu S, Nekkanti V . Insoluble drug delivery strategies: review of recent advances and business prospects. Acta Pharm Sin B. 2015; 5(5):442-53. PMC: 4629443. DOI: 10.1016/j.apsb.2015.07.003. View

17.

Llinas A, Avdeef A . Solubility Challenge Revisited after Ten Years, with Multilab Shake-Flask Data, Using Tight (SD ∼ 0.17 log) and Loose (SD ∼ 0.62 log) Test Sets. J Chem Inf Model. 2019; 59(6):3036-3040. DOI: 10.1021/acs.jcim.9b00345. View

18.

Boobier S, Hose D, Blacker A, Nguyen B . Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun. 2020; 11(1):5753. PMC: 7666209. DOI: 10.1038/s41467-020-19594-z. View

19.

Conn J, Carter J, Conn J, Subramanian V, Baxter A, Engkvist O . Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models. J Chem Inf Model. 2023; 63(4):1099-1113. PMC: 9976279. DOI: 10.1021/acs.jcim.2c01189. View

20.

Klajmon M . Purely Predicting the Pharmaceutical Solubility: What to Expect from PC-SAFT and COSMO-RS?. Mol Pharm. 2022; 19(11):4212-4232. DOI: 10.1021/acs.molpharmaceut.2c00573. View