Finding an Appropriate Equation to Measure Similarity Between Binary Vectors: Case Studies on Indonesian and Japanese Herbal Medicines

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2016 Dec 9

PMID 27927171

Citations 3

Authors

Sony Hartono Wijaya

Farit Mochamad Afendi

Irmanida Batubara

Latifah K Darusman

Md Altaf-Ul-Amin

Shigehiko Kanaya

Affiliations

Soon will be listed here.

Abstract

Background: The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of the positive matches, absence mismatches or negative matches. To our knowledge, there is no published work presenting a systematic way of finding an appropriate equation to measure binary similarity that performs well for certain data type or application. A proper method to select a suitable binary similarity or dissimilarity measure is needed to obtain better classification results.

Results: In this study, we proposed a novel approach to select binary similarity and dissimilarity measures. We collected 79 binary similarity and dissimilarity equations by extensive literature search and implemented those equations as an R package called bmeasures. We applied these metrics to quantify the similarity and dissimilarity between herbal medicine formulas belonging to the Indonesian Jamu and Japanese Kampo separately. We assessed the capability of binary equations to classify herbal medicine pairs into match and mismatch efficacies based on their similarity or dissimilarity coefficients using the Receiver Operating Characteristic (ROC) curve analysis. According to the area under the ROC curve results, we found Indonesian Jamu and Japanese Kampo datasets obtained different ranking of binary similarity and dissimilarity measures. Out of all the equations, the Forbes-2 similarity and the Variant of Correlation similarity measures are recommended for studying the relationship between Jamu formulas and Kampo formulas, respectively.

Conclusions: The selection of binary similarity and dissimilarity measures for multivariate analysis is data dependent. The proposed method can be used to find the most suitable binary similarity and dissimilarity equation wisely for a particular data. Our finding suggests that all four types of matching quantities in the Operational Taxonomic Unit (OTU) table are important to calculate the similarity and dissimilarity coefficients between herbal medicine formulas. Also, the binary similarity and dissimilarity measures that include the negative match quantity d achieve better capability to separate herbal medicine pairs compared to equations that exclude d.

Citing Articles

PPNet: Identifying Functional Association Networks by Phylogenetic Profiling of Prokaryotic Genomes.

Li Y, Ma B, Hua K, Gong H, He R, Luo R Microbiol Spectr. 2023; 11(1):e0387122.

PMID: 36602356 PMC: 9927313. DOI: 10.1128/spectrum.03871-22.

Comparative Analysis of Binary Similarity Measures for Compound Identification in MassSpectrometry-Based Metabolomics.

Kim S, Kato I, Zhang X Metabolites. 2022; 12(8).

PMID: 35893261 PMC: 9394311. DOI: 10.3390/metabo12080694.

A comparison of 71 binary similarity coefficients: The effect of base rates.

Brusco M, Cradit J, Steinley D PLoS One. 2021; 16(4):e0247751.

PMID: 33826612 PMC: 8026075. DOI: 10.1371/journal.pone.0247751.

References

Godden , Xue , Bajorath . Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. J Chem Inf Comput Sci. 2000; 40(1):163-6. DOI: 10.1021/ci990316u. View

Metz C . Basic principles of ROC analysis. Semin Nucl Med. 1978; 8(4):283-98. DOI: 10.1016/s0001-2998(78)80014-2. View

Holliday J, Hu C, Willett P . Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen. 2002; 5(2):155-66. DOI: 10.2174/1386207024607338. View

Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A . Sample classification from protein mass spectrometry, by 'peak probability contrasts'. Bioinformatics. 2004; 20(17):3034-44. DOI: 10.1093/bioinformatics/bth357. View

Kosman E, Leonard K . Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Mol Ecol. 2005; 14(2):415-24. DOI: 10.1111/j.1365-294X.2005.02416.x. View

Sing T, Sander O, Beerenwinkel N, Lengauer T . ROCR: visualizing classifier performance in R. Bioinformatics. 2005; 21(20):3940-1. DOI: 10.1093/bioinformatics/bti623. View

Sonego P, Kocsor A, Pongor S . ROC analysis: applications to the classification of biological sequences and 3D structures. Brief Bioinform. 2008; 9(3):198-209. DOI: 10.1093/bib/bbm064. View

Auer J, Bajorath J . Molecular similarity concepts and search calculations. Methods Mol Biol. 2008; 453:327-47. DOI: 10.1007/978-1-60327-429-6_17. View

Li M, Chen J, Wang J, Hu B, Chen G . Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008; 9:398. PMC: 2570695. DOI: 10.1186/1471-2105-9-398. View

10.

Dalirsefat S, da Silva Meyer A, Mirhoseini S . Comparison of similarity coefficients used for cluster analysis with amplified fragment length polymorphism markers in the silkworm, Bombyx mori. J Insect Sci. 2010; 9:1-8. PMC: 3011968. DOI: 10.1673/031.009.7101. View

11.

Afendi F, Okada T, Yamazaki M, Hirai-Morita A, Nakamura Y, Nakamura K . KNApSAcK family databases: integrated metabolite-plant species databases for multifaceted plant research. Plant Cell Physiol. 2011; 53(2):e1. DOI: 10.1093/pcp/pcr165. View

12.

Rojas-Cherto M, Peironcely J, Kasper P, van der Hooft J, de Vos R, Vreeken R . Metabolite identification using automated comparison of high-resolution multistage mass spectral trees. Anal Chem. 2012; 84(13):5524-34. DOI: 10.1021/ac2034216. View

13.

Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P . Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012; 52(11):2884-901. DOI: 10.1021/ci300261r. View

14.

Kedarisetti P, Mizianty M, Kaas Q, Craik D, Kurgan L . Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim Biophys Acta. 2013; 1844(1 Pt B):181-90. DOI: 10.1016/j.bbapap.2013.05.002. View

15.

Wijaya S, Husnawati H, Afendi F, Batubara I, Darusman L, Altaf-Ul-Amin M . Supervised clustering based on DPClusO: prediction of plant-disease relations using Jamu formulas of KNApSAcK database. Biomed Res Int. 2014; 2014:831751. PMC: 3997850. DOI: 10.1155/2014/831751. View

16.

Kangas J, Naik A, Murphy R . Efficient discovery of responses of proteins to compounds using active learning. BMC Bioinformatics. 2014; 15:143. PMC: 4030446. DOI: 10.1186/1471-2105-15-143. View

17.

Zhou T, Shen N, Yang L, Abe N, Horton J, Mann R . Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A. 2015; 112(15):4654-9. PMC: 4403198. DOI: 10.1073/pnas.1422023112. View

18.

Pinoli P, Chicco D, Masseroli M . Computational algorithms to predict Gene Ontology annotations. BMC Bioinformatics. 2015; 16 Suppl 6:S4. PMC: 4416163. DOI: 10.1186/1471-2105-16-S6-S4. View

19.

Bien J, Tibshirani R . Hierarchical Clustering With Prototypes via Minimax Linkage. J Am Stat Assoc. 2015; 106(495):1075-1084. PMC: 4527350. DOI: 10.1198/jasa.2011.tm10183. View

20.

Okada T, Afendi F, Yamazaki M, Chida K, Suzuki M, Kawai R . Informatics framework of traditional Sino-Japanese medicine (Kampo) unveiled by factor analysis. J Nat Med. 2015; 70(1):107-14. PMC: 4662717. DOI: 10.1007/s11418-015-0946-0. View