» Articles » PMID: 36755234

On Triangle Inequalities of Correlation-based Distances for Gene Expression Profiles

Overview
Publisher Biomed Central
Specialty Biology
Date 2023 Feb 9
PMID 36755234
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated-either negatively or positively-and vice versa. One popular distance function is the absolute correlation distance, [Formula: see text], where [Formula: see text] is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering.

Results: In this work, we propose [Formula: see text] as an alternative. We prove that [Formula: see text] satisfies the triangle inequality when [Formula: see text] represents Pearson correlation, Spearman correlation, or Cosine similarity. We show [Formula: see text] to be better than [Formula: see text], another variant of [Formula: see text] that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared [Formula: see text] with [Formula: see text] in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, [Formula: see text] demonstrated more robust clustering. According to the bootstrap experiment, [Formula: see text] generated more robust sample pair partition more frequently (P-value [Formula: see text]). The statistics on the time a class "dissolved" also support the advantage of [Formula: see text] in robustness.

Conclusion: [Formula: see text], as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.

Citing Articles

Deconstructing the Mapper algorithm to extract richer topological and temporal features from functional neuroimaging data.

Hasegan D, Geniesse C, Chowdhury S, Saggar M Netw Neurosci. 2024; 8(4):1355-1382.

PMID: 39735492 PMC: 11675014. DOI: 10.1162/netn_a_00403.

References
1.
Hardin J, Mitani A, Hicks L, VanKoten B . A robust measure of correlation between two genes on a microarray. BMC Bioinformatics. 2007; 8:220. PMC: 1929126. DOI: 10.1186/1471-2105-8-220. View

2.
Eisen M, Spellman P, Brown P, Botstein D . Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998; 95(25):14863-8. PMC: 24541. DOI: 10.1073/pnas.95.25.14863. View

3.
Deng Y, Jiang Y, Yang Y, He Z, Luo F, Zhou J . Molecular ecological network analyses. BMC Bioinformatics. 2012; 13:113. PMC: 3428680. DOI: 10.1186/1471-2105-13-113. View

4.
Langfelder P, Horvath S . WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008; 9:559. PMC: 2631488. DOI: 10.1186/1471-2105-9-559. View

5.
Datta S, Datta S . Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003; 19(4):459-66. DOI: 10.1093/bioinformatics/btg025. View