Gene Prediction in Metagenomic Fragments Based on the SVM Algorithm

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2013 Jun 6

PMID 23735199

Citations 33

Authors

Yongchu Liu

Jiangtao Guo

Gangqing Hu

Huaiqiu Zhu

Affiliations

Soon will be listed here.

Abstract

Background: Metagenomic sequencing is becoming a powerful technology for exploring micro-ogranisms from various environments, such as human body, without isolation and cultivation. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues.

Results: In this article, we present a novel gene prediction method named MetaGUN for metagenomic fragments based on a machine learning approach of SVM. It implements in a three-stage strategy to predict genes. Firstly, it classifies input fragments into phylogenetic groups by a k-mer based sequence binning method. Then, protein-coding sequences are identified for each group independently with SVM classifiers that integrate entropy density profiles (EDP) of codon usage, translation initiation site (TIS) scores and open reading frame (ORF) length as input patterns. Finally, the TISs are adjusted by employing a modified version of MetaTISA. To identify protein-coding sequences, MetaGun builds the universal module and the novel module. The former is based on a set of representative species, while the latter is designed to find potential functionary DNA sequences with conserved domains.

Conclusions: Comparisons on artificial shotgun fragments with multiple current metagenomic gene finders show that MetaGUN predicts better results on both 3' and 5' ends of genes with fragments of various lengths. Especially, it makes the most reliable predictions among these methods. As an application, MetaGUN was used to predict genes for two samples of human gut microbiome. It identifies thousands of additional genes with significant evidences. Further analysis indicates that MetaGUN tends to predict more potential novel genes than other current metagenomic gene finders.

Citing Articles

GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides.

Singh J, Khanna N, Rout R, Singh N, Laird J, Singh I Sci Rep. 2024; 14(1):7154.

PMID: 38531923 PMC: 11344070. DOI: 10.1038/s41598-024-56786-9.

A toolbox of machine learning software to support microbiome analysis.

Marcos-Zambrano L, Lopez-Molina V, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T Front Microbiol. 2023; 14:1250806.

PMID: 38075858 PMC: 10704913. DOI: 10.3389/fmicb.2023.1250806.

Genome-centric insight into metabolically active microbial population in shallow-sea hydrothermal vents.

Chen X, Tang K, Zhang M, Liu S, Chen M, Zhan P Microbiome. 2022; 10(1):170.

PMID: 36242065 PMC: 9563475. DOI: 10.1186/s40168-022-01351-7.

Multi-Omics Approaches and Resources for Systems-Level Gene Function Prediction in the Plant Kingdom.

Abdullah-Zawawi M, Govender N, Harun S, Nor Muhammad N, Zainal Z, Mohamed-Hussein Z Plants (Basel). 2022; 11(19).

PMID: 36235479 PMC: 9573505. DOI: 10.3390/plants11192614.

PINC: A Tool for Non-Coding RNA Identification in Plants Based on an Automated Machine Learning Framework.

Zhang X, Zhou X, Wan M, Xuan J, Jin X, Li S Int J Mol Sci. 2022; 23(19).

PMID: 36233123 PMC: 9570155. DOI: 10.3390/ijms231911825.

References

Venter J, Remington K, Heidelberg J, Halpern A, Rusch D, Eisen J . Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004; 304(5667):66-74. DOI: 10.1126/science.1093857. View

Zhu H, Hu G, Ouyang Z, Wang J, She Z . Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics. 2004; 20(18):3308-17. DOI: 10.1093/bioinformatics/bth390. View

Besemer J, Lomsadze A, Borodovsky M . GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001; 29(12):2607-18. PMC: 55746. DOI: 10.1093/nar/29.12.2607. View

Pruitt K, Tatusova T, Klimke W, Maglott D . NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2008; 37(Database issue):D32-6. PMC: 2686572. DOI: 10.1093/nar/gkn721. View

Marchler-Bauer A, Anderson J, Chitsaz F, Derbyshire M, DeWeese-Scott C, Fong J . CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2008; 37(Database issue):D205-10. PMC: 2686570. DOI: 10.1093/nar/gkn845. View

Tringe S, von Mering C, Kobayashi A, Salamov A, Chen K, Chang H . Comparative metagenomics of microbial communities. Science. 2005; 308(5721):554-7. DOI: 10.1126/science.1107851. View

Kelley D, Liu B, Delcher A, Pop M, Salzberg S . Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 2011; 40(1):e9. PMC: 3245904. DOI: 10.1093/nar/gkr1067. View

Hoff K . The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009; 10:520. PMC: 2781827. DOI: 10.1186/1471-2164-10-520. View

Ouyang Z, Zhu H, Wang J, She Z . Multivariate entropy distance method for prokaryotic gene identification. J Bioinform Comput Biol. 2004; 2(2):353-73. DOI: 10.1142/s0219720004000624. View

10.

Makita Y, de Hoon M, Danchin A . Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes. BMC Bioinformatics. 2007; 8:47. PMC: 1805508. DOI: 10.1186/1471-2105-8-47. View

11.

Delcher A, Bratke K, Powers E, Salzberg S . Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007; 23(6):673-9. PMC: 2387122. DOI: 10.1093/bioinformatics/btm009. View

12.

Frishman D, Mironov A, Mewes H, Gelfand M . Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998; 26(12):2941-7. PMC: 147632. DOI: 10.1093/nar/26.12.2941. View

13.

Krause L, Diaz N, Bartels D, Edwards R, Puhler A, Rohwer F . Finding novel genes in bacterial communities isolated from the environment. Bioinformatics. 2006; 22(14):e281-9. DOI: 10.1093/bioinformatics/btl247. View

14.

Delcher A, Harmon D, Kasif S, White O, Salzberg S . Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999; 27(23):4636-41. PMC: 148753. DOI: 10.1093/nar/27.23.4636. View

15.

Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, Richardson P . Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004; 428(6978):37-43. DOI: 10.1038/nature02340. View

16.

Sandberg R, Winberg G, Branden C, Kaske A, Ernberg I, Coster J . Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res. 2001; 11(8):1404-9. PMC: 311094. DOI: 10.1101/gr.186401. View

17.

Qin J, Li R, Raes J, Arumugam M, Burgdorf K, Manichanh C . A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010; 464(7285):59-65. PMC: 3779803. DOI: 10.1038/nature08821. View

18.

Hugenholtz P . Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002; 3(2):REVIEWS0003. PMC: 139013. DOI: 10.1186/gb-2002-3-2-reviews0003. View

19.

Zhu H, Hu G, Yang Y, Wang J, She Z . MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. BMC Bioinformatics. 2007; 8:97. PMC: 1847833. DOI: 10.1186/1471-2105-8-97. View

20.

Singh A, Doerks T, Letunic I, Raes J, Bork P . Discovering functional novelty in metagenomes: examples from light-mediated processes. J Bacteriol. 2008; 191(1):32-41. PMC: 2612456. DOI: 10.1128/JB.01084-08. View