» Articles » PMID: 30669970

Cluster Analysis of Replicated Alternative Polyadenylation Data Using Canonical Correlation Analysis

Overview
Journal BMC Genomics
Publisher Biomed Central
Specialty Genetics
Date 2019 Jan 24
PMID 30669970
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3' end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes.

Results: Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3' end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes.

Conclusions: By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3' end sequencing data to address the complex biological phenomenon.

References
1.
Fu H, Yang D, Su W, Ma L, Shen Y, Ji G . Genome-wide dynamics of alternative polyadenylation in rice. Genome Res. 2016; 26(12):1753-1760. PMC: 5131826. DOI: 10.1101/gr.210757.116. View

2.
Wilms I, Croux C . Robust sparse canonical correlation analysis. BMC Syst Biol. 2016; 10(1):72. PMC: 4982144. DOI: 10.1186/s12918-016-0317-9. View

3.
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F . Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights. 2016; 10:237-253. PMC: 5135122. DOI: 10.4137/BBI.S38316. View

4.
Stoiber M, Olson S, May G, Duff M, Manent J, Obar R . Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila. Genome Res. 2015; 25(11):1692-702. PMC: 4617965. DOI: 10.1101/gr.182675.114. View

5.
Kocsy G, Galiba G, Brunold C . Role of glutathione in adaptation and signalling during chilling and cold acclimation in plants. Physiol Plant. 2002; 113(2):158-164. DOI: 10.1034/j.1399-3054.2001.1130202.x. View