Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models

Overview

Journal Nat Methods

Specialties Biomedical Engineering
Pathology

Date 2009 Aug 4

PMID 19648916

Citations 215

Authors

Arthur Brady

Steven L Salzberg

Affiliations

Soon will be listed here.

Abstract

Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.

Citing Articles

Species annotation using a k-mer based KNN model.

Sangar S, Kolage P, Chunarkar-Patil P Bioinformation. 2025; 20(9):986-989.

PMID: 39917243 PMC: 11795478. DOI: 10.6026/973206300200986.

MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.

Lu R, Dumonceaux T, Anzar M, Zovoilis A, Antonation K, Barker D Bioinformatics. 2024; 40(10).

PMID: 39388213 PMC: 11522871. DOI: 10.1093/bioinformatics/btae601.

MetaCompass: Reference-guided Assembly of Metagenomes.

Luan T, Cepeda V, Liu B, Bowen Z, Ayyangar U, Almeida M ArXiv. 2024; .

PMID: 38903742 PMC: 11188144.

Visualizing metagenomic and metatranscriptomic data: A comprehensive review.

Aplakidou E, Vergoulidis N, Chasapi M, Venetsianou N, Kokoli M, Panagiotopoulou E Comput Struct Biotechnol J. 2024; 23:2011-2033.

PMID: 38765606 PMC: 11101950. DOI: 10.1016/j.csbj.2024.04.060.

A toolbox of machine learning software to support microbiome analysis.

Marcos-Zambrano L, Lopez-Molina V, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T Front Microbiol. 2023; 14:1250806.

PMID: 38075858 PMC: 10704913. DOI: 10.3389/fmicb.2023.1250806.

References

Manichanh C, Chapple C, Frangeul L, Gloux K, Guigo R, Dore J . A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library. Nucleic Acids Res. 2008; 36(16):5180-8. PMC: 2532719. DOI: 10.1093/nar/gkn496. View

Delcher A, Bratke K, Powers E, Salzberg S . Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007; 23(6):673-9. PMC: 2387122. DOI: 10.1093/bioinformatics/btm009. View

Salzberg S, Delcher A, Kasif S, White O . Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998; 26(2):544-8. PMC: 147303. DOI: 10.1093/nar/26.2.544. View

Pruitt K, Tatusova T, Maglott D . NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2006; 35(Database issue):D61-5. PMC: 1716718. DOI: 10.1093/nar/gkl842. View

Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy A . Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007; 4(6):495-500. DOI: 10.1038/nmeth1043. View

White J, Roberts M, Yorke J, Pop M . Figaro: a novel statistical method for vector sequence removal. Bioinformatics. 2008; 24(4):462-7. PMC: 2725436. DOI: 10.1093/bioinformatics/btm632. View

Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P . A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008; 72(4):557-78, Table of Contents. PMC: 2593568. DOI: 10.1128/MMBR.00009-08. View

Karlin S, Burge C . Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283-90. DOI: 10.1016/s0168-9525(00)89076-9. View

Rondon M, August P, Bettermann A, Brady S, Grossman T, Liles M . Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Appl Environ Microbiol. 2000; 66(6):2541-7. PMC: 110579. DOI: 10.1128/AEM.66.6.2541-2547.2000. View

10.

Tringe S, von Mering C, Kobayashi A, Salamov A, Chen K, Chang H . Comparative metagenomics of microbial communities. Science. 2005; 308(5721):554-7. DOI: 10.1126/science.1107851. View

11.

Krause L, Diaz N, Goesmann A, Kelley S, Nattkemper T, Rohwer F . Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008; 36(7):2230-9. PMC: 2367736. DOI: 10.1093/nar/gkn038. View

12.

McHardy A, Garcia Martin H, Tsirigos A, Hugenholtz P, Rigoutsos I . Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods. 2006; 4(1):63-72. DOI: 10.1038/nmeth976. View

13.

Delcher A, Harmon D, Kasif S, White O, Salzberg S . Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999; 27(23):4636-41. PMC: 148753. DOI: 10.1093/nar/27.23.4636. View

14.

Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, Richardson P . Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004; 428(6978):37-43. DOI: 10.1038/nature02340. View

15.

Chapus C, Dufraigne C, Edwards S, Giron A, Fertil B, Deschavanne P . Exploration of phylogenetic data using a global sequence analysis method. BMC Evol Biol. 2005; 5:63. PMC: 1310607. DOI: 10.1186/1471-2148-5-63. View

16.

Tito R, Macmil S, Wiley G, Najar F, Cleeland L, Qu C . Phylotyping and functional analysis of two ancient human microbiomes. PLoS One. 2008; 3(11):e3703. PMC: 2577302. DOI: 10.1371/journal.pone.0003703. View

17.

Dinsdale E, Pantos O, Smriga S, Edwards R, Angly F, Wegley L . Microbial ecology of four coral atolls in the Northern Line Islands. PLoS One. 2008; 3(2):e1584. PMC: 2253183. DOI: 10.1371/journal.pone.0001584. View

18.

Delcher A, Salzberg S, Phillippy A . Using MUMmer to identify similar regions in large sequence sets. Curr Protoc Bioinformatics. 2008; Chapter 10:Unit 10.3. DOI: 10.1002/0471250953.bi1003s00. View

19.

Huson D, Auch A, Qi J, Schuster S . MEGAN analysis of metagenomic data. Genome Res. 2007; 17(3):377-86. PMC: 1800929. DOI: 10.1101/gr.5969107. View

20.

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389-402. PMC: 146917. DOI: 10.1093/nar/25.17.3389. View