» Articles » PMID: 27927171

Finding an Appropriate Equation to Measure Similarity Between Binary Vectors: Case Studies on Indonesian and Japanese Herbal Medicines

Overview
Publisher Biomed Central
Specialty Biology
Date 2016 Dec 9
PMID 27927171
Citations 3
Authors
Affiliations
Soon will be listed here.
Abstract

Background: The binary similarity and dissimilarity measures have critical roles in the processing of data consisting of binary vectors in various fields including bioinformatics and chemometrics. These metrics express the similarity and dissimilarity values between two binary vectors in terms of the positive matches, absence mismatches or negative matches. To our knowledge, there is no published work presenting a systematic way of finding an appropriate equation to measure binary similarity that performs well for certain data type or application. A proper method to select a suitable binary similarity or dissimilarity measure is needed to obtain better classification results.

Results: In this study, we proposed a novel approach to select binary similarity and dissimilarity measures. We collected 79 binary similarity and dissimilarity equations by extensive literature search and implemented those equations as an R package called bmeasures. We applied these metrics to quantify the similarity and dissimilarity between herbal medicine formulas belonging to the Indonesian Jamu and Japanese Kampo separately. We assessed the capability of binary equations to classify herbal medicine pairs into match and mismatch efficacies based on their similarity or dissimilarity coefficients using the Receiver Operating Characteristic (ROC) curve analysis. According to the area under the ROC curve results, we found Indonesian Jamu and Japanese Kampo datasets obtained different ranking of binary similarity and dissimilarity measures. Out of all the equations, the Forbes-2 similarity and the Variant of Correlation similarity measures are recommended for studying the relationship between Jamu formulas and Kampo formulas, respectively.

Conclusions: The selection of binary similarity and dissimilarity measures for multivariate analysis is data dependent. The proposed method can be used to find the most suitable binary similarity and dissimilarity equation wisely for a particular data. Our finding suggests that all four types of matching quantities in the Operational Taxonomic Unit (OTU) table are important to calculate the similarity and dissimilarity coefficients between herbal medicine formulas. Also, the binary similarity and dissimilarity measures that include the negative match quantity d achieve better capability to separate herbal medicine pairs compared to equations that exclude d.

Citing Articles

PPNet: Identifying Functional Association Networks by Phylogenetic Profiling of Prokaryotic Genomes.

Li Y, Ma B, Hua K, Gong H, He R, Luo R Microbiol Spectr. 2023; 11(1):e0387122.

PMID: 36602356 PMC: 9927313. DOI: 10.1128/spectrum.03871-22.


Comparative Analysis of Binary Similarity Measures for Compound Identification in MassSpectrometry-Based Metabolomics.

Kim S, Kato I, Zhang X Metabolites. 2022; 12(8).

PMID: 35893261 PMC: 9394311. DOI: 10.3390/metabo12080694.


A comparison of 71 binary similarity coefficients: The effect of base rates.

Brusco M, Cradit J, Steinley D PLoS One. 2021; 16(4):e0247751.

PMID: 33826612 PMC: 8026075. DOI: 10.1371/journal.pone.0247751.

References
1.
Godden , Xue , Bajorath . Combinatorial preferences affect molecular similarity/diversity calculations using binary fingerprints and Tanimoto coefficients. J Chem Inf Comput Sci. 2000; 40(1):163-6. DOI: 10.1021/ci990316u. View

2.
Metz C . Basic principles of ROC analysis. Semin Nucl Med. 1978; 8(4):283-98. DOI: 10.1016/s0001-2998(78)80014-2. View

3.
Holliday J, Hu C, Willett P . Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen. 2002; 5(2):155-66. DOI: 10.2174/1386207024607338. View

4.
Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A . Sample classification from protein mass spectrometry, by 'peak probability contrasts'. Bioinformatics. 2004; 20(17):3034-44. DOI: 10.1093/bioinformatics/bth357. View

5.
Kosman E, Leonard K . Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Mol Ecol. 2005; 14(2):415-24. DOI: 10.1111/j.1365-294X.2005.02416.x. View