Dissecting Gene Expression Heterogeneity: Generalized Pearson Correlation Squares and the -lines Clustering Algorithm

Overview

Journal J Am Stat Assoc

Specialty Public Health

Date 2024 Dec 19

PMID 39697782

Authors

Jingyi Jessica Li

Heather J Zhou

Peter J Bickel

Xin Tong

Affiliations

Soon will be listed here.

Abstract

Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To compute the generalized Pearson correlation square from a sample without a line-membership specification, we develop a -lines clustering algorithm to find clusters that exhibit distinct linear dependences, where can be chosen in a data-adaptive way. To infer the population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and show the power advantage of the generalized Pearson correlation squares in capturing mixtures of linear dependences. Gene expression data analyses demonstrate the effectiveness of the generalized Pearson correlation squares and the -lines clustering algorithm in dissecting complex but interpretable relationships. The estimation and inference procedures are implemented in the R package gR2 (https://github.com/lijy03/gR2).

Citing Articles

Categorization of 34 computational methods to detect spatially variable genes from spatially resolved transcriptomics data.

Yan G, Hua S, Li J Nat Commun. 2025; 16(1):1141.

PMID: 39880807 PMC: 11779979. DOI: 10.1038/s41467-025-56080-w.

Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data.

Yan G, Hua S, Li J ArXiv. 2024; .

PMID: 38855546 PMC: 11160866.

References

Miranda M, Macias-Velasco J, Lawson H . Pancreatic β-cell heterogeneity in health and diabetes: classes, sources, and subtypes. Am J Physiol Endocrinol Metab. 2021; 320(4):E716-E731. PMC: 8238131. DOI: 10.1152/ajpendo.00649.2020. View

Li K . Genome-wide coexpression dynamics: theory and application. Proc Natl Acad Sci U S A. 2002; 99(26):16875-80. PMC: 139237. DOI: 10.1073/pnas.252466999. View

Wang X, Jiang B, Liu J . Generalized R-squared for detecting dependence. Biometrika. 2018; 104(1):129-139. PMC: 5793683. DOI: 10.1093/biomet/asw071. View

Kosorok M . On Brownian Distance Covariance and High Dimensional Data. Ann Appl Stat. 2010; 3(4):1266-1269. PMC: 2889501. DOI: 10.1214/09-AOAS312. View

Kim K, Jiang K, Teng S, Feldman L, Huang H . Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics. 2012; 28(6):815-22. PMC: 3307114. DOI: 10.1093/bioinformatics/bts038. View

Reshef D, Reshef Y, Finucane H, Grossman S, McVean G, Turnbaugh P . Detecting novel associations in large data sets. Science. 2011; 334(6062):1518-24. PMC: 3325791. DOI: 10.1126/science.1205438. View

Wang Y, Waterman M, Huang H . Gene coexpression measures in large heterogeneous samples using count statistics. Proc Natl Acad Sci U S A. 2014; 111(46):16371-6. PMC: 4246260. DOI: 10.1073/pnas.1417128111. View

Song D, Li K, Hemminger Z, Wollman R, Li J . scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling. Bioinformatics. 2021; 37(Suppl_1):i358-i366. PMC: 8275345. DOI: 10.1093/bioinformatics/btab273. View

Baron M, Veres A, Wolock S, Faust A, Gaujoux R, Vetere A . A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst. 2016; 3(4):346-360.e4. PMC: 5228327. DOI: 10.1016/j.cels.2016.08.011. View

10.

Lee D, Zhu B . A semiparametric kernel independence test with application to mutational signatures. J Am Stat Assoc. 2022; 116(536):1648-1661. PMC: 9070557. DOI: 10.1080/01621459.2020.1871357. View

11.

Smith R . Use and misuse of the reduced major axis for line-fitting. Am J Phys Anthropol. 2009; 140(3):476-86. DOI: 10.1002/ajpa.21090. View

12.

Jacobs R, Jordan M, Nowlan S, Hinton G . Adaptive Mixtures of Local Experts. Neural Comput. 2019; 3(1):79-87. DOI: 10.1162/neco.1991.3.1.79. View

13.

Scharl T, Gru B, Leisch F . Mixtures of regression models for time course gene expression data: evaluation of initialization and random effects. Bioinformatics. 2009; 26(3):370-7. DOI: 10.1093/bioinformatics/btp686. View

14.

Li J, Hansen B, Ober J, Kliebenstein D, Halkier B . Subclade of flavin-monooxygenases involved in aliphatic glucosinolate biosynthesis. Plant Physiol. 2008; 148(3):1721-33. PMC: 2577257. DOI: 10.1104/pp.108.125757. View