» Articles » PMID: 38429687

A Clustering Procedure for Three-way RNA Sequencing Data Using Data Transformations and Matrix-variate Gaussian Mixture Models

Overview
Publisher Biomed Central
Specialty Biology
Date 2024 Mar 1
PMID 38429687
Authors
Affiliations
Soon will be listed here.
Abstract

RNA sequencing of time-course experiments results in three-way count data where the dimensions are the genes, the time points and the biological units. Clustering RNA-seq data allows to extract groups of co-expressed genes over time. After standardisation, the normalised counts of individual genes across time points and biological units have similar properties as compositional data. We propose the following procedure to suitably cluster three-way RNA-seq data: (1) pre-process the RNA-seq data by calculating the normalised expression profiles, (2) transform the data using the additive log ratio transform to map the composition in the D-part Aitchison simplex to a -dimensional Euclidean vector, (3) cluster the transformed RNA-seq data using matrix-variate Gaussian mixture models and (4) assess the quality of the overall cluster solution and of individual clusters based on cluster separation in the transformed space using density-based silhouette information and on compactness of the cluster in the original space using cluster maps as a suitable visualisation. The proposed procedure is illustrated on RNA-seq data from fission yeast and results are also compared to an analogous two-way approach after flattening out the biological units.

References
1.
Love M, Anders S, Kim V, Huber W . RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Res. 2015; 4:1070. PMC: 4670015. DOI: 10.12688/f1000research.7035.1. View

2.
Pontes B, Giraldez R, Aguilar-Ruiz J . Biclustering on expression data: A review. J Biomed Inform. 2015; 57:163-80. DOI: 10.1016/j.jbi.2015.06.028. View

3.
Bourgon R, Gentleman R, Huber W . Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci U S A. 2010; 107(21):9546-51. PMC: 2906865. DOI: 10.1073/pnas.0914005107. View

4.
Love M, Huber W, Anders S . Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12):550. PMC: 4302049. DOI: 10.1186/s13059-014-0550-8. View

5.
Nueda M, Tarazona S, Conesa A . Next maSigPro: updating maSigPro bioconductor package for RNA-seq time series. Bioinformatics. 2014; 30(18):2598-602. PMC: 4155246. DOI: 10.1093/bioinformatics/btu333. View