» Articles » PMID: 32393369

Compressing Gene Expression Data Using Multiple Latent Space Dimensionalities Learns Complementary Biological Representations

Overview
Journal Genome Biol
Specialties Biology
Genetics
Date 2020 May 13
PMID 32393369
Citations 33
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses.

Results: We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities.

Conclusions: There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.

Citing Articles

BuDDI: Bulk Deconvolution with Domain Invariance to predict cell-type-specific perturbations from bulk.

Davidson N, Zhang F, Greene C PLoS Comput Biol. 2025; 21(1):e1012742.

PMID: 39823522 PMC: 11790236. DOI: 10.1371/journal.pcbi.1012742.


Diffusion-based generation of gene regulatory networks from scRNA-seq data with DigNet.

Wang C, Liu Z, Liu Z Genome Res. 2024; 35(2):340-354.

PMID: 39694856 PMC: 11874984. DOI: 10.1101/gr.279551.124.


Deep profiling of gene expression across 18 human cancers.

Qiu W, Dincer A, Janizek J, Celik S, Pittet M, Naxerova K Nat Biomed Eng. 2024; .

PMID: 39690287 DOI: 10.1038/s41551-024-01290-8.


Latent space arithmetic on data embeddings from healthy multi-tissue human RNA-seq decodes disease modules.

de Weerd H, Guala D, Gustafsson M, Synnergren J, Tegner J, Lubovac-Pilav Z Patterns (N Y). 2024; 5(11):101093.

PMID: 39568475 PMC: 11573900. DOI: 10.1016/j.patter.2024.101093.


iModulonMiner and PyModulon: Software for unsupervised mining of gene expression compendia.

Sastry A, Yuan Y, Poudel S, Rychel K, Yoo R, Lamoureux C PLoS Comput Biol. 2024; 20(10):e1012546.

PMID: 39441835 PMC: 11534266. DOI: 10.1371/journal.pcbi.1012546.


References
1.
Kong W, Vanderburg C, Gunshin H, Rogers J, Huang X . A review of independent component analysis application to microarray gene expression data. Biotechniques. 2008; 45(5):501-20. PMC: 3005719. DOI: 10.2144/000112950. View

2.
Greenman C, Stephens P, Smith R, Dalgliesh G, Hunter C, Bignell G . Patterns of somatic mutation in human cancer genomes. Nature. 2007; 446(7132):153-8. PMC: 2712719. DOI: 10.1038/nature05610. View

3.
Johnson W, Li C, Rabinovic A . Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2006; 8(1):118-27. DOI: 10.1093/biostatistics/kxj037. View

4.
Mermel C, Schumacher S, Hill B, Meyerson M, Beroukhim R, Getz G . GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011; 12(4):R41. PMC: 3218867. DOI: 10.1186/gb-2011-12-4-r41. View

5.
Vivian J, Rao A, Nothaft F, Ketchum C, Armstrong J, Novak A . Toil enables reproducible, open source, big biomedical data analyses. Nat Biotechnol. 2017; 35(4):314-316. PMC: 5546205. DOI: 10.1038/nbt.3772. View