IdentiCS--identification of Coding Sequence and in Silico Reconstruction of the Metabolic Network Directly from Unannotated Low-coverage Bacterial Genome Sequence

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2004 Aug 18

PMID 15312235

Citations 10

Authors

Jibin Sun

An-Ping Zeng

Affiliations

Soon will be listed here.

Abstract

Background: A necessary step for a genome level analysis of the cellular metabolism is the in silico reconstruction of the metabolic network from genome sequences. The available methods are mainly based on the annotation of genome sequences including two successive steps, the prediction of coding sequences (CDS) and their function assignment. The annotation process takes time. The available methods often encounter difficulties when dealing with unfinished error-containing genomic sequence.

Results: In this work a fast method is proposed to use unannotated genome sequence for predicting CDSs and for an in silico reconstruction of metabolic networks. Instead of using predicted genes or CDSs to query public databases, entries from public DNA or protein databases are used as queries to search a local database of the unannotated genome sequence to predict CDSs. Functions are assigned to the predicted CDSs simultaneously. The well-annotated genome of Salmonella typhimurium LT2 is used as an example to demonstrate the applicability of the method. 97.7% of the CDSs in the original annotation are correctly identified. The use of SWISS-PROT-TrEMBL databases resulted in an identification of 98.9% of CDSs that have EC-numbers in the published annotation. Furthermore, two versions of sequences of the bacterium Klebsiella pneumoniae with different genome coverage (3.9 and 7.9 fold, respectively) are examined. The results suggest that a 3.9-fold coverage of the bacterial genome could be sufficiently used for the in silico reconstruction of the metabolic network. Compared to other gene finding methods such as CRITICA our method is more suitable for exploiting sequences of low genome coverage. Based on the new method, a program called IdentiCS (Identification of Coding Sequences from Unfinished Genome Sequences) is delivered that combines the identification of CDSs with the reconstruction, comparison and visualization of metabolic networks (free to download at http://genome.gbf.de/bioinformatics/index.html).

Conclusions: The reversed querying process and the program IdentiCS allow a fast and adequate prediction protein coding sequences and reconstruction of the potential metabolic network from low coverage genome sequences of bacteria. The new method can accelerate the use of genomic data for studying cellular metabolism.

Citing Articles

Predicting pathways for old and new metabolites through clustering.

Siddharth T, Lewis N J Theor Biol. 2023; 578:111684.

PMID: 38048983 PMC: 11139542. DOI: 10.1016/j.jtbi.2023.111684.

The RAVEN toolbox and its use for generating a genome-scale metabolic model for Penicillium chrysogenum.

Agren R, Liu L, Shoaie S, Vongsangnak W, Nookaew I, Nielsen J PLoS Comput Biol. 2013; 9(3):e1002980.

PMID: 23555215 PMC: 3605104. DOI: 10.1371/journal.pcbi.1002980.

Machine learning methods for metabolic pathway prediction.

Dale J, Popescu L, Karp P BMC Bioinformatics. 2010; 11:15.

PMID: 20064214 PMC: 3146072. DOI: 10.1186/1471-2105-11-15.

Flux Design: In silico design of cell factories based on correlation of pathway fluxes to desired properties.

Melzer G, Esfandabadi M, Franco-Lara E, Wittmann C BMC Syst Biol. 2009; 3:120.

PMID: 20035624 PMC: 2808316. DOI: 10.1186/1752-0509-3-120.

Genome-scale models of bacterial metabolism: reconstruction and applications.

Durot M, Bourguignon P, Schachter V FEMS Microbiol Rev. 2008; 33(1):164-90.

PMID: 19067749 PMC: 2704943. DOI: 10.1111/j.1574-6976.2008.00146.x.

References

Pearson W . Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol. 1999; 132:185-219. DOI: 10.1385/1-59259-192-2:185. View

Ma H, Zeng A . Phylogenetic comparison of metabolic capacities of organisms at genome level. Mol Phylogenet Evol. 2004; 31(1):204-13. DOI: 10.1016/j.ympev.2003.08.011. View

Kanehisa M, Goto S . KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 1999; 28(1):27-30. PMC: 102409. DOI: 10.1093/nar/28.1.27. View

Karp P, Riley M, Saier M, Paulsen I, Paley S, Pellegrini-Toole A . The EcoCyc and MetaCyc databases. Nucleic Acids Res. 1999; 28(1):56-9. PMC: 102475. DOI: 10.1093/nar/28.1.56. View

Overbeek R, Larsen N, Pusch G, DSouza M, Selkov Jr E, Kyrpides N . WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res. 1999; 28(1):123-5. PMC: 102471. DOI: 10.1093/nar/28.1.123. View

Ideker T, Thorsson V, Ranish J, Christmas R, Buhler J, Eng J . Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001; 292(5518):929-34. DOI: 10.1126/science.292.5518.929. View

Besemer J, Lomsadze A, Borodovsky M . GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001; 29(12):2607-18. PMC: 55746. DOI: 10.1093/nar/29.12.2607. View

McClelland M, Sanderson K, Spieth J, Clifton S, Latreille P, Courtney L . Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature. 2001; 413(6858):852-6. DOI: 10.1038/35101614. View

Karp P, Riley M, Saier M, Paulsen I, Collado-Vides J, Paley S . The EcoCyc Database. Nucleic Acids Res. 2001; 30(1):56-8. PMC: 99147. DOI: 10.1093/nar/30.1.56. View

10.

Burset M, Guigo R . Evaluation of gene structure prediction programs. Genomics. 1996; 34(3):353-67. DOI: 10.1006/geno.1996.0298. View

11.

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389-402. PMC: 146917. DOI: 10.1093/nar/25.17.3389. View

12.

Selkov Jr E, Grechkin Y, Mikhailova N, Selkov E . MPW: the Metabolic Pathways Database. Nucleic Acids Res. 1997; 26(1):43-5. PMC: 147231. DOI: 10.1093/nar/26.1.43. View

13.

Badger J, Olsen G . CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999; 16(4):512-24. DOI: 10.1093/oxfordjournals.molbev.a026133. View

14.

Falquet L, Pagni M, Bucher P, Hulo N, Sigrist C, Hofmann K . The PROSITE database, its status in 2002. Nucleic Acids Res. 2001; 30(1):235-8. PMC: 99105. DOI: 10.1093/nar/30.1.235. View

15.

Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy S . The Pfam protein families database. Nucleic Acids Res. 2001; 30(1):276-80. PMC: 99071. DOI: 10.1093/nar/30.1.276. View

16.

Goesmann A, Haubrock M, Meyer F, Kalinowski J, Giegerich R . PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics. 2002; 18(1):124-9. DOI: 10.1093/bioinformatics/18.1.124. View

17.

Mulder N, Apweiler R, Attwood T, Bairoch A, Barrell D, Bateman A . The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 2003; 31(1):315-8. PMC: 165493. DOI: 10.1093/nar/gkg046. View

18.

Ma H, Zeng A . Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics. 2003; 19(2):270-7. DOI: 10.1093/bioinformatics/19.2.270. View

19.

Guo F, Ou H, Zhang C . ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res. 2003; 31(6):1780-9. PMC: 152858. DOI: 10.1093/nar/gkg254. View

20.

Ma H, Zeng A . The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics. 2003; 19(11):1423-30. DOI: 10.1093/bioinformatics/btg177. View