Y-Randomization and Its Variants in QSPR/QSAR

Overview

Journal J Chem Inf Model

Publisher American Chemical Society

Specialties Chemistry
Medical Informatics

Date 2007 Sep 21

PMID 17880194

Citations 164

Authors

Christoph Rucker

Gerta Rucker

Markus Meringer

Affiliations

Soon will be listed here.

Abstract

y-Randomization is a tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure. We compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors or random number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection. For each combination of number of observations (compounds), number of descriptors in the final model, and number of descriptors in the pool to select from, computer experiments using the same descriptor selection method result in two different mean highest random r2 values. A lower one is produced by y-randomization or a variant likewise based on the original descriptors, while a higher one is obtained from variants that use random number pseudodescriptors. The difference is due to the intercorrelation of real descriptors in the pool. We propose to compare an original model's r2 to both of these whenever possible. The meaning of the three possible outcomes of such a double test is discussed. Often y-randomization is not available to a potential user of a model, due to the values of all descriptors in the pool for all compounds not being published. In such cases random number experiments as proposed here are still possible. The test was applied to several recently published MLR QSAR equations, and cases of failure were identified. Some progress also is reported toward the aim of obtaining the mean highest r2 of random pseudomodels by calculation rather than by tedious multiple simulations on random number variables.

Citing Articles

Machine learning-driven discovery of highly selective antifungal peptides containing non-canonical β-amino acids.

Chang D, Richardson J, Lee M, Lynn D, Palecek S, Van Lehn R Chem Sci. 2025; .

PMID: 40028619 PMC: 11867109. DOI: 10.1039/d4sc06689h.

Predictive Modeling of Pesticides Reproductive Toxicity in Earthworms Using Interpretable Machine-Learning Techniques on Imbalanced Data.

Kotli M, Piir G, Maran U ACS Omega. 2025; 10(5):4732-4744.

PMID: 39959051 PMC: 11822515. DOI: 10.1021/acsomega.4c09719.

Structural Insights and Potential Inhibitor Identification Based on the Benzothiazole Core for Targeting Pteridine Reductase 1.

de O Viana J, Weber K, da Cruz L, Santos R, Rocha G, Jordao A ACS Omega. 2025; 10(1):306-317.

PMID: 39829523 PMC: 11740253. DOI: 10.1021/acsomega.4c06146.

ADMET evaluation in drug discovery: 21. Application and industrial validation of machine learning algorithms for Caco-2 permeability prediction.

Wang D, Jin J, Shi G, Bao J, Wang Z, Li S J Cheminform. 2025; 17(1):3.

PMID: 39794857 PMC: 11724520. DOI: 10.1186/s13321-025-00947-z.

Integrating traditional QSAR and read-across-based regression models for predicting potential anti-leishmanial azole compounds.

Nandi R, Sharma A, Priya A, Kumar D Mol Divers. 2024; .

PMID: 39653961 DOI: 10.1007/s11030-024-11070-w.