Clusternomics: Integrative Context-dependent Clustering for Heterogeneous Datasets

Overview

Journal PLoS Comput Biol

Specialty Biology

Date 2017 Oct 17

PMID 29036190

Citations 27

Authors

Evelina Gabasova

John Reid

Lorenz Wernisch

Affiliations

Soon will be listed here.

Abstract

Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

Citing Articles

Integrating Multi-Omics with environmental data for precision health: A novel analytic framework and case study on prenatal mercury induced childhood fatty liver disease.

Goodrich J, Wang H, Jia Q, Stratakis N, Zhao Y, Maitre L Environ Int. 2024; 190:108930.

PMID: 39128376 PMC: 11620538. DOI: 10.1016/j.envint.2024.108930.

Bayesian Multi-View Clustering given complex inter-view structure.

Shapiro B, Battle A F1000Res. 2024; 11:1460.

PMID: 38495778 PMC: 10940850. DOI: 10.12688/f1000research.126215.2.

intCC: An efficient weighted integrative consensus clustering of multimodal data.

Huang C, Kuan P Pac Symp Biocomput. 2023; 29:627-640.

PMID: 38160311 PMC: 10764072.

A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures.

Mokou M, Narayanasamy S, Stroggilos R, Balaur I, Vlahou A, Mischak H Methods Mol Biol. 2023; 2684:59-99.

PMID: 37410228 DOI: 10.1007/978-1-0716-3291-8_4.

Bayesian cluster analysis.

Wade S Philos Trans A Math Phys Eng Sci. 2023; 381(2247):20220149.

PMID: 36970819 PMC: 10041359. DOI: 10.1098/rsta.2022.0149.

References

Hellton K, Thoresen M . Integrative clustering of high-dimensional data with joint and individual clusters. Biostatistics. 2016; 17(3):537-48. DOI: 10.1093/biostatistics/kxw005. View

Shen R, Mo Q, Schultz N, Seshan V, Olshen A, Huse J . Integrative subtype discovery in glioblastoma using iCluster. PLoS One. 2012; 7(4):e35236. PMC: 3335101. DOI: 10.1371/journal.pone.0035236. View

Ovaska K, Laakso M, Haapa-Paananen S, Louhimo R, Chen P, Aittomaki V . Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme. Genome Med. 2010; 2(9):65. PMC: 3092116. DOI: 10.1186/gm186. View

Shen R, Olshen A, Ladanyi M . Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009; 25(22):2906-12. PMC: 2800366. DOI: 10.1093/bioinformatics/btp543. View

Medvedovic M, Sivaganesan S . Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics. 2002; 18(9):1194-206. DOI: 10.1093/bioinformatics/18.9.1194. View

Senbabaoglu Y, Michailidis G, Li J . Critical limitations of consensus clustering in class discovery. Sci Rep. 2014; 4:6207. PMC: 4145288. DOI: 10.1038/srep06207. View

. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490(7418):61-70. PMC: 3465532. DOI: 10.1038/nature11412. View

Kristensen V, Lingjaerde O, Russnes H, Vollan H, Frigessi A, Borresen-Dale A . Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer. 2014; 14(5):299-313. DOI: 10.1038/nrc3721. View

Wang B, Mezlini A, Demir F, Fiume M, Tu Z, Brudno M . Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014; 11(3):333-7. DOI: 10.1038/nmeth.2810. View

10.

Kirk P, Griffin J, Savage R, Ghahramani Z, Wild D . Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012; 28(24):3290-7. PMC: 3519452. DOI: 10.1093/bioinformatics/bts595. View

11.

Lock E, Dunson D . Bayesian consensus clustering. Bioinformatics. 2013; 29(20):2610-6. PMC: 3789539. DOI: 10.1093/bioinformatics/btt425. View