Evaluating High-throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2012 Dec 11

PMID 23226328

Citations 15

Authors

Stephen J Goodswen

Paul J Kennedy

John T Ellis

Affiliations

Soon will be listed here.

Abstract

Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen's genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.

Citing Articles

GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes.

Bruna T, Lomsadze A, Borodovsky M Genome Res. 2024; 34(5):757-768.

PMID: 38866548 PMC: 11216313. DOI: 10.1101/gr.278373.123.

Comparative Genome Annotation.

Nachtweide S, Romoth L, Stanke M Methods Mol Biol. 2024; 2802:165-187.

PMID: 38819560 DOI: 10.1007/978-1-0716-3838-5_7.

A new gene finding tool GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes.

Bruna T, Lomsadze A, Borodovsky M bioRxiv. 2023; .

PMID: 36711453 PMC: 9882169. DOI: 10.1101/2023.01.13.524024.

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes.

Meyer C, Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson J BMC Bioinformatics. 2020; 21(1):513.

PMID: 33172385 PMC: 7656754. DOI: 10.1186/s12859-020-03855-1.

Using AnABlast for intergenic sORF prediction in the Caenorhabditis elegans genome.

Casimiro-Soriguer C, Rigual M, Brokate-Llanos A, Munoz M, Garzon A, Perez-Pulido A Bioinformatics. 2020; 36(19):4827-4832.

PMID: 32614398 PMC: 7723330. DOI: 10.1093/bioinformatics/btaa608.

References

Gelfand M, Mironov A, Pevzner P . Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A. 1996; 93(17):9061-6. PMC: 38595. DOI: 10.1073/pnas.93.17.9061. View

Burset M, Seledtsov I, Solovyev V . Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 2000; 28(21):4364-75. PMC: 113136. DOI: 10.1093/nar/28.21.4364. View

Che F, Madrid-Aliste C, Burd B, Zhang H, Nieves E, Kim K . Comprehensive proteomic analysis of membrane proteins in Toxoplasma gondii. Mol Cell Proteomics. 2010; 10(1):M110.000745. PMC: 3013445. DOI: 10.1074/mcp.M110.000745. View

Wastling J, Xia D, Sohal A, Chaussepied M, Pain A, Langsley G . Proteomes and transcriptomes of the Apicomplexa--where's the message?. Int J Parasitol. 2008; 39(2):135-43. DOI: 10.1016/j.ijpara.2008.10.003. View

Allen J, Pertea M, Salzberg S . Computational gene prediction using multiple sources of evidence. Genome Res. 2004; 14(1):142-8. PMC: 314291. DOI: 10.1101/gr.1562804. View

Lukashin A, Borodovsky M . GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998; 26(4):1107-15. PMC: 147337. DOI: 10.1093/nar/26.4.1107. View

Guigo R, Agarwal P, Abril J, Burset M, Fickett J . An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 2000; 10(10):1631-42. PMC: 310940. DOI: 10.1101/gr.122800. View

Mathe C, Sagot M, Schiex T, Rouze P . Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002; 30(19):4103-17. PMC: 140543. DOI: 10.1093/nar/gkf543. View

Gross S, Brent M . Using multiple alignments to improve gene prediction. J Comput Biol. 2006; 13(2):379-93. DOI: 10.1089/cmb.2006.13.379. View

10.

Service R . DNA imaging. Getting a feel for genetic variations. Science. 2000; 289(5476):27-8. DOI: 10.1126/science.289.5476.27a. View

11.

Sleator R . An overview of the current status of eukaryote gene prediction strategies. Gene. 2010; 461(1-2):1-4. DOI: 10.1016/j.gene.2010.04.008. View

12.

Parra G, Agarwal P, Abril J, Wiehe T, Fickett J, Guigo R . Comparative gene prediction in human and mouse. Genome Res. 2003; 13(1):108-17. PMC: 430976. DOI: 10.1101/gr.871403. View

13.

Liu Q, Crammer K, Pereira F, Roos D . Reranking candidate gene models with cross-species comparison for improved gene prediction. BMC Bioinformatics. 2008; 9:433. PMC: 2587481. DOI: 10.1186/1471-2105-9-433. View

14.

Kim K, Weiss L . Toxoplasma gondii: the model apicomplexan. Int J Parasitol. 2004; 34(3):423-32. PMC: 3086386. DOI: 10.1016/j.ijpara.2003.12.009. View

15.

Solovyev V, Salamov A, Lawrence C . Identification of human gene structure using linear discriminant functions and dynamic programming. Proc Int Conf Intell Syst Mol Biol. 1995; 3:367-75. View

16.

Flower D, Macdonald I, Ramakrishnan K, Davies M, Doytchinova I . Computer aided selection of candidate vaccine antigens. Immunome Res. 2010; 6 Suppl 2:S1. PMC: 2981880. DOI: 10.1186/1745-7580-6-S2-S1. View

17.

Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis S . Identifying protein-coding genes in genomic sequences. Genome Biol. 2009; 10(1):201. PMC: 2687780. DOI: 10.1186/gb-2009-10-1-201. View

18.

DeCaprio D, Vinson J, Pearson M, Montgomery P, Doherty M, Galagan J . Conrad: gene prediction using conditional random fields. Genome Res. 2007; 17(9):1389-98. PMC: 1950907. DOI: 10.1101/gr.6558107. View

19.

Gross S, Do C, Sirota M, Batzoglou S . CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol. 2007; 8(12):R269. PMC: 2246271. DOI: 10.1186/gb-2007-8-12-r269. View

20.

Allen J, Salzberg S . JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005; 21(18):3596-603. DOI: 10.1093/bioinformatics/bti609. View