Distance Based Algorithms for Small Biomolecule Classification and Structural Similarity Search

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2006 Jul 29

PMID 16873478

Citations 6

Authors

Emre Karakoc

Artem Cherkasov

S Cenk Sahinalp

Affiliations

Soon will be listed here.

Abstract

Motivation: Structural similarity search among small molecules is a standard tool used in molecular classification and in-silico drug discovery. The effectiveness of this general approach depends on how well the following problems are addressed. The notion of similarity should be chosen for providing the highest level of discrimination of compounds wrt the bioactivity of interest. The data structure for performing search should be very efficient as the molecular databases of interest include several millions of compounds.

Results: In this paper we focus on the k-nearest-neighbor search method, which, until recently was not considered for small molecule classification. The few recent applications of k-nn to compound classification focus on selecting the most relevant set of chemical descriptors which are then compared under standard Minkowski distance L(p). Here we show how to computationally design the optimal weighted Minkowski distance wL(p) for maximizing the discrimination between active and inactive compounds wrt bioactivities of interest. We then show how to construct pruning based k-nn search data structures for any wL(p) distance that minimizes similarity search time. The accuracy achieved by our classifier is better than the alternative LDA and MLR approaches and is comparable to the ANN methods. In terms of running time, our classifier is considerably faster than the ANN approach especially when large data sets are used. Furthermore, our classifier quantifies the level of bioactivity rather than returning a binary decision and thus is more informative than the ANN approach.

Citing Articles

Machine Learning Study of Metabolic Networks ChEMBL Data of Antibacterial Compounds.

Dieguez-Santana K, Casanola-Martin G, Torres R, Rasulev B, Green J, Gonzalez-Diaz H Mol Pharm. 2022; 19(7):2151-2163.

PMID: 35671399 PMC: 9986951. DOI: 10.1021/acs.molpharmaceut.2c00029.

Machine Learning in Antibacterial Drug Design.

Jukic M, Bren U Front Pharmacol. 2022; 13:864412.

PMID: 35592425 PMC: 9110924. DOI: 10.3389/fphar.2022.864412.

Identification of Novel Antibacterials Using Machine Learning Techniques.

Ivanenkov Y, Zhavoronkov A, Yamidanov R, Osterman I, Sergiev P, Aladinskiy V Front Pharmacol. 2019; 10:913.

PMID: 31507413 PMC: 6719509. DOI: 10.3389/fphar.2019.00913.

Phosphoproteomic analyses reveal novel cross-modulation mechanisms between two signaling pathways in yeast.

Vaga S, Bernardo-Faura M, Cokelaer T, Maiolica A, Barnes C, Gillet L Mol Syst Biol. 2014; 10:767.

PMID: 25492886 PMC: 4300490. DOI: 10.15252/msb.20145112.

Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries.

Kumar P, Ma X, Liu X, Jia J, Bucong H, Xue Y J Comput Aided Mol Des. 2011; 25(5):455-67.

PMID: 21556903 DOI: 10.1007/s10822-011-9431-3.