Gene Recognition Via Spliced Sequence Alignment

Overview

Journal Proc Natl Acad Sci U S A

Specialty Science

Date 1996 Aug 20

PMID 8799154

Citations 59

Authors

M S Gelfand

A A Mironov

P A Pevzner

Affiliations

Soon will be listed here.

Abstract

Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.

Citing Articles

ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs.

Dvorkina T, Bankevich A, Sorokin A, Yang F, Adu-Oppong B, Williams R Microbiome. 2021; 9(1):149.

PMID: 34183047 PMC: 8240309. DOI: 10.1186/s40168-021-01092-z.

BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database.

Bruna T, Hoff K, Lomsadze A, Stanke M, Borodovsky M NAR Genom Bioinform. 2021; 3(1):lqaa108.

PMID: 33575650 PMC: 7787252. DOI: 10.1093/nargab/lqaa108.

Cooperation of Spaln and Prrn5 for Construction of Gene-Structure-Aware Multiple Sequence Alignment.

Gotoh O Methods Mol Biol. 2020; 2231:71-88.

PMID: 33289887 DOI: 10.1007/978-1-0716-1036-7_5.

MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics.

Levy Karin E, Mirdita M, Soding J Microbiome. 2020; 8(1):48.

PMID: 32245390 PMC: 7126354. DOI: 10.1186/s40168-020-00808-x.

Whole-Genome Alignment and Comparative Annotation.

Armstrong J, Fiddes I, Diekhans M, Paten B Annu Rev Anim Biosci. 2018; 7:41-64.

PMID: 30379572 PMC: 6450745. DOI: 10.1146/annurev-animal-020518-115005.

References

Fickett J . Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982; 10(17):5303-18. PMC: 320873. DOI: 10.1093/nar/10.17.5303. View

Gelfand M, Podolsky L, Astakhova T, Roytberg M . Recognition of genes in human DNA sequences. J Comput Biol. 1996; 3(2):223-34. DOI: 10.1089/cmb.1996.3.223. View

Harr R, Haggstrom M, Gustafsson P . Search algorithm for pattern match analysis of nucleic acid sequences. Nucleic Acids Res. 1983; 11(9):2943-57. PMC: 325935. DOI: 10.1093/nar/11.9.2943. View

Glasser S, Korfhagen T, Perme C, Kister S, Whitsett J . Two SP-C genes encoding human pulmonary surfactant proteolipid. J Biol Chem. 1988; 263(21):10326-31. View

Myers E, Miller W . Approximate matching of regular expressions. Bull Math Biol. 1989; 51(1):5-37. DOI: 10.1007/BF02458834. View

Gelfand M . Computer prediction of the exon-intron structure of mammalian pre-mRNAs. Nucleic Acids Res. 1990; 18(19):5865-9. PMC: 332326. DOI: 10.1093/nar/18.19.5865. View

Altschul S, Gish W, Miller W, Myers E, Lipman D . Basic local alignment search tool. J Mol Biol. 1990; 215(3):403-10. DOI: 10.1016/S0022-2836(05)80360-2. View

Altschul S . Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991; 219(3):555-65. PMC: 7130686. DOI: 10.1016/0022-2836(91)90193-a. View

Legouis R, Hardelin J, Levilliers J, Claverie J, Compain S, Wunderle V . The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules. Cell. 1991; 67(2):423-35. DOI: 10.1016/0092-8674(91)90193-3. View

10.

Uberbacher E, Mural R . Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A. 1991; 88(24):11261-5. PMC: 53114. DOI: 10.1073/pnas.88.24.11261. View

11.

Pascarella S, Argos P . Analysis of insertions/deletions in protein structures. J Mol Biol. 1992; 224(2):461-71. DOI: 10.1016/0022-2836(92)91008-d. View

12.

Guigo R, Knudsen S, Drake N, Smith T . Prediction of gene structure. J Mol Biol. 1992; 226(1):141-57. DOI: 10.1016/0022-2836(92)90130-c. View

13.

Sankoff D . Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory. Math Biosci. 1992; 111(2):279-93. DOI: 10.1016/0025-5564(92)90075-8. View

14.

Snyder E, Stormo G . Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 1993; 21(3):607-13. PMC: 309159. DOI: 10.1093/nar/21.3.607. View

15.

Gish W, States D . Identification of protein coding regions by database similarity search. Nat Genet. 1993; 3(3):266-72. DOI: 10.1038/ng0393-266. View

16.

Adams M, Kerlavage A, Fields C, Venter J . 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nat Genet. 1993; 4(3):256-67. DOI: 10.1038/ng0793-256. View

17.

Gelfand M, Roytberg M . Prediction of the exon-intron structure by a dynamic programming approach. Biosystems. 1993; 30(1-3):173-82. DOI: 10.1016/0303-2647(93)90069-o. View

18.

Song I, Brown D, Wiltshire R, Gantz I, Trent J, Yamada T . The human gastrin/cholecystokinin type B receptor gene: alternative splice donor site in exon 4 generates two variant mRNAs. Proc Natl Acad Sci U S A. 1993; 90(19):9085-9. PMC: 47506. DOI: 10.1073/pnas.90.19.9085. View

19.

Solovyev V, Salamov A, Lawrence C . Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 1994; 22(24):5156-63. PMC: 332054. DOI: 10.1093/nar/22.24.5156. View

20.

Dong S, Searls D . Gene structure prediction by linguistic methods. Genomics. 1994; 23(3):540-51. DOI: 10.1006/geno.1994.1541. View