» Articles » PMID: 18072965

Penalized Likelihood for Sparse Contingency Tables with an Application to Full-length CDNA Libraries

Overview
Publisher Biomed Central
Specialty Biology
Date 2007 Dec 13
PMID 18072965
Citations 3
Authors
Affiliations
Soon will be listed here.
Abstract

Background: The joint analysis of several categorical variables is a common task in many areas of biology, and is becoming central to systems biology investigations whose goal is to identify potentially complex interaction among variables belonging to a network. Interactions of arbitrary complexity are traditionally modeled in statistics by log-linear models. It is challenging to extend these to the high dimensional and potentially sparse data arising in computational biology. An important example, which provides the motivation for this article, is the analysis of so-called full-length cDNA libraries of alternatively spliced genes, where we investigate relationships among the presence of various exons in transcript species.

Results: We develop methods to perform model selection and parameter estimation in log-linear models for the analysis of sparse contingency tables, to study the interaction of two or more factors. Maximum Likelihood estimation of log-linear model coefficients might not be appropriate because of the presence of zeros in the table's cells, and new methods are required. We propose a computationally efficient l1-penalization approach extending the Lasso algorithm to this context, and compare it to other procedures in a simulation study. We then illustrate these algorithms on contingency tables arising from full-length cDNA libraries.

Conclusion: We propose regularization methods that can be used successfully to detect complex interaction patterns among categorical variables in a broad range of biological problems involving categorical variables.

Citing Articles

Statistical Methods and Software for Substance Use and Dependence Genetic Research.

Lan T, Yang B, Zhang X, Wang T, Lu Q Curr Genomics. 2020; 20(3):172-183.

PMID: 31929725 PMC: 6935956. DOI: 10.2174/1389202920666190617094930.


Bayesian modeling of temporal dependence in large sparse contingency tables.

Kunihama T, Dunson D J Am Stat Assoc. 2014; 108(504):1324-1338.

PMID: 24482548 PMC: 3904485. DOI: 10.1080/01621459.2013.823866.


Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation.

Li J, Jiang C, Brown J, Huang H, Bickel P Proc Natl Acad Sci U S A. 2011; 108(50):19867-72.

PMID: 22135461 PMC: 3250192. DOI: 10.1073/pnas.1113972108.

References
1.
Imanishi T, Itoh T, Suzuki Y, ODonovan C, Fukuchi S, Koyanagi K . Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004; 2(6):e162. PMC: 393292. DOI: 10.1371/journal.pbio.0020162. View

2.
Zavolan M, van Nimwegen E, Gaasterland T . Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Res. 2002; 12(9):1377-85. PMC: 186662. DOI: 10.1101/gr.191702. View

3.
Carninci P, Kasukawa T, Katayama S, Gough J, Frith M, Maeda N . The transcriptional landscape of the mammalian genome. Science. 2005; 309(5740):1559-63. DOI: 10.1126/science.1112014. View

4.
Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J . Initial sequencing and analysis of the human genome. Nature. 2001; 409(6822):860-921. DOI: 10.1038/35057062. View

5.
. Finishing the euchromatic sequence of the human genome. Nature. 2004; 431(7011):931-45. DOI: 10.1038/nature03001. View