» Articles » PMID: 36778756

Causal Discovery in High-dimensional, Multicollinear Datasets

Overview
Journal Front Epidemiol
Specialty Public Health
Date 2023 Feb 13
PMID 36778756
Authors
Affiliations
Soon will be listed here.
Abstract

As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had Covid-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known Covid-19 related biological pathways.

Citing Articles

Streamlining NMR Chemical Shift Predictions for Intrinsically Disordered Proteins: Design of Ensembles with Dimensionality Reduction and Clustering.

Bakker M, Gaffour A, Juhas M, Zapletal V, Stosek J, Bratholm L J Chem Inf Model. 2024; 64(16):6542-6556.

PMID: 39099394 PMC: 11412307. DOI: 10.1021/acs.jcim.4c00809.

References
1.
Wang W, Stephens M . Empirical Bayes Matrix Factorization. J Mach Learn Res. 2023; 22. PMC: 10621241. View

2.
Vandel J, Cassan O, Lebre S, Lecellier C, Brehelin L . Probing transcription factor combinatorics in different promoter classes and in enhancers. BMC Genomics. 2019; 20(1):103. PMC: 6359851. DOI: 10.1186/s12864-018-5408-0. View

3.
Pascal L, True L, Campbell D, Deutsch E, Risk M, Coleman I . Correlation of mRNA and protein levels: cell type-specific gene expression of cluster designation antigens in the prostate. BMC Genomics. 2008; 9:246. PMC: 2413246. DOI: 10.1186/1471-2164-9-246. View

4.
Goodman M, Trinca G, Walter K, Papachristou E, DSantos C, Li T . Progesterone Receptor Attenuates STAT1-Mediated IFN Signaling in Breast Cancer. J Immunol. 2019; 202(10):3076-3086. PMC: 6504603. DOI: 10.4049/jimmunol.1801152. View

5.
Kovats S . Estrogen receptors regulate innate immune cells and signaling pathways. Cell Immunol. 2015; 294(2):63-9. PMC: 4380804. DOI: 10.1016/j.cellimm.2015.01.018. View