» Articles » PMID: 18831778

Improving the Specificity of Exon Prediction Using Comparative Genomics

Overview
Journal BMC Genomics
Publisher Biomed Central
Specialty Genetics
Date 2008 Oct 10
PMID 18831778
Citations 1
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions.

Methods: A statistical approach is developed that largely improves the gene prediction specificity. The key idea is to utilize the evolutionary conservation principle relative to the coding exons. By first exploiting the homology between genomes of two related species, a probability model for the evolutionary conservation pattern of codons across different genomes is developed. A probability model for the dependency between adjacent codons/triplets is added to differentiate coding exons and random sequences. Finally, the log odds ratio is developed to classify putative exons into the group of coding exons and the group of non-coding regions.

Results: The method was tested on pre-aligned human-mouse sequences where the putative exons are predicted by GENSCAN and TWINSCAN. The proposed method is able to improve the exon specificity by 73% and 32% respectively, while the loss of the sensitivity < or = 1%. The method also keeps 98% of RefSeq gene structures that are correctly predicted by TWINSCAN when removing 26% of predicted genes that are in non-coding regions. The estimated number of true exons in TWINSCAN's predictions is 157,070. The results and the executable codes can be downloaded from http://www.stat.purdue.edu/~jingwu/codon/

Conclusion: The proposed method demonstrates an application of the evolutionary conservation principle to coding exons. It is a complementary method which can be used as an additional criteria to refine many existing gene predictions.

Citing Articles

Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research.

Yang J, Yang M, Arabnia H, Deng Y BMC Genomics. 2008; 9 Suppl 2:I1.

PMID: 18831773 PMC: 3226104. DOI: 10.1186/1471-2164-9-S2-I1.

References
1.
Hardison R, Oeltjen J, Miller W . Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res. 1997; 7(10):959-66. DOI: 10.1101/gr.7.10.959. View

2.
Bafna V, Huson D . The conserved exon method for gene finding. Proc Int Conf Intell Syst Mol Biol. 2000; 8:3-12. View

3.
Korf I, Flicek P, Duan D, Brent M . Integrating genomic homology into gene structure prediction. Bioinformatics. 2001; 17 Suppl 1:S140-8. DOI: 10.1093/bioinformatics/17.suppl_1.s140. View

4.
Ansari-Lari M, Oeltjen J, Schwartz S, Zhang Z, Muzny D, Lu J . Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 1998; 8(1):29-40. View

5.
Gelfand M, Mironov A, Pevzner P . Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A. 1996; 93(17):9061-6. PMC: 38595. DOI: 10.1073/pnas.93.17.9061. View