» Articles » PMID: 25742011

Systematic Artifacts in Support Vector Regression-based Compound Potency Prediction Revealed by Statistical and Activity Landscape Analysis

Overview
Journal PLoS One
Date 2015 Mar 6
PMID 25742011
Citations 10
Authors
Affiliations
Soon will be listed here.
Abstract

Support vector machines are a popular machine learning method for many classification tasks in biology and chemistry. In addition, the support vector regression (SVR) variant is widely used for numerical property predictions. In chemoinformatics and pharmaceutical research, SVR has become the probably most popular approach for modeling of non-linear structure-activity relationships (SARs) and predicting compound potency values. Herein, we have systematically generated and analyzed SVR prediction models for a variety of compound data sets with different SAR characteristics. Although these SVR models were accurate on the basis of global prediction statistics and not prone to overfitting, they were found to consistently mispredict highly potent compounds. Hence, in regions of local SAR discontinuity, SVR prediction models displayed clear limitations. Compared to observed activity landscapes of compound data sets, landscapes generated on the basis of SVR potency predictions were partly flattened and activity cliff information was lost. Taken together, these findings have implications for practical SVR applications. In particular, prospective SVR-based potency predictions should be considered with caution because artificially low predictions are very likely for highly potent candidate compounds, the most important prediction targets.

Citing Articles

Developing an advanced prediction model for new employee turnover intention utilizing machine learning techniques.

Park J, Feng Y, Jeong S Sci Rep. 2024; 14(1):1221.

PMID: 38216616 PMC: 10786846. DOI: 10.1038/s41598-023-50593-4.


Using Machine Learning Algorithms to Pool Data from Meta-Analysis for the Prediction of Countermovement Jump Improvement.

Ho I, Weldon A, Yong J, Lam C, Sampaio J Int J Environ Res Public Health. 2023; 20(10).

PMID: 37239607 PMC: 10218708. DOI: 10.3390/ijerph20105881.


Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation.

Dutschmann T, Kinzel L, Ter Laak A, Baumann K J Cheminform. 2023; 15(1):49.

PMID: 37118768 PMC: 10142532. DOI: 10.1186/s13321-023-00709-9.


Predicting Potent Compounds Using a Conditional Variational Autoencoder Based upon a New Structure-Potency Fingerprint.

Janela T, Takeuchi K, Bajorath J Biomolecules. 2023; 13(2).

PMID: 36830761 PMC: 9953226. DOI: 10.3390/biom13020393.


Trajectory tracking of changes digital divide prediction factors in the elderly through machine learning.

Park J, Feng Y PLoS One. 2023; 18(2):e0281291.

PMID: 36763570 PMC: 9916605. DOI: 10.1371/journal.pone.0281291.


References
1.
Lind P, Maltseva T . Support vector machines for the estimation of aqueous solubility. J Chem Inf Comput Sci. 2003; 43(6):1855-9. DOI: 10.1021/ci034107s. View

2.
Baell J, Holloway G . New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem. 2010; 53(7):2719-40. DOI: 10.1021/jm901137j. View

3.
Leong M . A novel approach using pharmacophore ensemble/support vector machine (PhE/SVM) for prediction of hERG liability. Chem Res Toxicol. 2007; 20(2):217-26. DOI: 10.1021/tx060230c. View

4.
Song M, Clark M . Development and evaluation of an in silico model for hERG binding. J Chem Inf Model. 2006; 46(1):392-400. DOI: 10.1021/ci050308f. View

5.
Stumpfe D, Hu Y, Dimova D, Bajorath J . Recent progress in understanding activity cliffs and their utility in medicinal chemistry. J Med Chem. 2013; 57(1):18-28. DOI: 10.1021/jm401120g. View