Cluster Analysis of Replicated Alternative Polyadenylation Data Using Canonical Correlation Analysis

Overview

Journal BMC Genomics

Publisher Biomed Central

Specialty Genetics

Date 2019 Jan 24

PMID 30669970

Authors

Wenbin Ye

Yuqi Long

Guoli Ji

Yaru Su

Pengchao Ye

Hongjuan Fu

Xiaohui Wu

Affiliations

Soon will be listed here.

Abstract

Background: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3' end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes.

Results: Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3' end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes.

Conclusions: By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3' end sequencing data to address the complex biological phenomenon.

References

Fu H, Yang D, Su W, Ma L, Shen Y, Ji G . Genome-wide dynamics of alternative polyadenylation in rice. Genome Res. 2016; 26(12):1753-1760. PMC: 5131826. DOI: 10.1101/gr.210757.116. View

Wilms I, Croux C . Robust sparse canonical correlation analysis. BMC Syst Biol. 2016; 10(1):72. PMC: 4982144. DOI: 10.1186/s12918-016-0317-9. View

Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F . Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights. 2016; 10:237-253. PMC: 5135122. DOI: 10.4137/BBI.S38316. View

Stoiber M, Olson S, May G, Duff M, Manent J, Obar R . Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila. Genome Res. 2015; 25(11):1692-702. PMC: 4617965. DOI: 10.1101/gr.182675.114. View

Kocsy G, Galiba G, Brunold C . Role of glutathione in adaptation and signalling during chilling and cold acclimation in plants. Physiol Plant. 2002; 113(2):158-164. DOI: 10.1034/j.1399-3054.2001.1130202.x. View

Anders S, Huber W . Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10):R106. PMC: 3218662. DOI: 10.1186/gb-2010-11-10-r106. View

Eisen M, Spellman P, Brown P, Botstein D . Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998; 95(25):14863-8. PMC: 24541. DOI: 10.1073/pnas.95.25.14863. View

Tian B, Manley J . Alternative cleavage and polyadenylation: the long and short of it. Trends Biochem Sci. 2013; 38(6):312-20. PMC: 3800139. DOI: 10.1016/j.tibs.2013.03.005. View

Haddad J, Harb H . L-gamma-Glutamyl-L-cysteinyl-glycine (glutathione; GSH) and GSH-related enzymes in the regulation of pro- and anti-inflammatory cytokines: a signaling transcriptional scenario for redox(y) immunologic sensor(s)?. Mol Immunol. 2005; 42(9):987-1014. DOI: 10.1016/j.molimm.2004.09.029. View

10.

Pirim H, Eksioglu B, Perkins A, Yuceer C . Clustering of High Throughput Gene Expression Data. Comput Oper Res. 2012; 39(12):3046-3061. PMC: 3491664. DOI: 10.1016/j.cor.2012.03.008. View

11.

Xia L, Steele J, Cram J, Cardon Z, Simmons S, Vallino J . Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates. BMC Syst Biol. 2012; 5 Suppl 2:S15. PMC: 3287481. DOI: 10.1186/1752-0509-5-S2-S15. View

12.

Hoque M, Ji Z, Zheng D, Luo W, Li W, You B . Analysis of alternative cleavage and polyadenylation by 3' region extraction and deep sequencing. Nat Methods. 2012; 10(2):133-9. PMC: 3560312. DOI: 10.1038/nmeth.2288. View

13.

Ulitsky I, Shkumatava A, Jan C, Subtelny A, Koppstein D, Bell G . Extensive alternative polyadenylation during zebrafish development. Genome Res. 2012; 22(10):2054-66. PMC: 3460199. DOI: 10.1101/gr.139733.112. View

14.

Hong S, Chen X, Jin L, Xiong M . Canonical correlation analysis for RNA-seq co-expression networks. Nucleic Acids Res. 2013; 41(8):e95. PMC: 3632131. DOI: 10.1093/nar/gkt145. View

15.

Li Y, Sun Y, Fu Y, Li M, Huang G, Zhang C . Dynamic landscape of tandem 3' UTRs during zebrafish development. Genome Res. 2012; 22(10):1899-906. PMC: 3460185. DOI: 10.1101/gr.128488.111. View

16.

Wang R, Nambiar R, Zheng D, Tian B . PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. Nucleic Acids Res. 2017; 46(D1):D315-D319. PMC: 5753232. DOI: 10.1093/nar/gkx1000. View

17.

Handl J, Knowles J, Kell D . Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005; 21(15):3201-12. DOI: 10.1093/bioinformatics/bti517. View

18.

Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, van Baren M . Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28(5):511-5. PMC: 3146043. DOI: 10.1038/nbt.1621. View

19.

Yeung K, Medvedovic M, Bumgarner R . Clustering gene-expression data with repeated measurements. Genome Biol. 2003; 4(5):R34. PMC: 156590. DOI: 10.1186/gb-2003-4-5-r34. View

20.

You L, Wu J, Feng Y, Fu Y, Guo Y, Long L . APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals. Nucleic Acids Res. 2014; 43(Database issue):D59-67. PMC: 4383914. DOI: 10.1093/nar/gku1076. View