How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space

Overview

Journal J Chem Inf Model

Publisher American Chemical Society

Specialties Chemistry
Medical Informatics

Date 2009 Jan 7

PMID 19123924

Citations 85

Authors

Andreas Bender

Jeremy L Jenkins

Josef Scheiber

Sai Chetan K Sukuru

Meir Glick

John W Davies

Affiliations

Soon will be listed here.

Abstract

Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.

Citing Articles

Retrieval Augmented Docking Using Hierarchical Navigable Small Worlds.

Hall B, Keiser M J Chem Inf Model. 2024; 64(19):7398-7408.

PMID: 39360680 PMC: 11480973. DOI: 10.1021/acs.jcim.4c00683.

Identification of Optimal Machine Learning Algorithms and Molecular Fingerprints for Explainable Toxicity Prediction Models Using ToxCast/Tox21 Bioassay Data.

Kim D, Jeong J, Choi J ACS Omega. 2024; 9(36):37934-37941.

PMID: 39281924 PMC: 11391437. DOI: 10.1021/acsomega.4c04474.

Do Molecular Fingerprints Identify Diverse Active Drugs in Large-Scale Virtual Screening? (No).

Venkatraman V, Gaiser J, Demekas D, Roy A, Xiong R, Wheeler T Pharmaceuticals (Basel). 2024; 17(8).

PMID: 39204097 PMC: 11356940. DOI: 10.3390/ph17080992.

Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.

Mansouri K, Taylor K, Auerbach S, Ferguson S, Frawley R, Hsieh J Environ Health Perspect. 2024; 132(8):85002.

PMID: 39106156 PMC: 11302584. DOI: 10.1289/EHP14001.

The Chameleon Strategy-A Recipe for Effective Ligand Screening for Viral Targets Based on Four Novel Structure-Binding Strength Indices.

Latosinska M, Latosinska J Viruses. 2024; 16(7).

PMID: 39066235 PMC: 11281727. DOI: 10.3390/v16071073.