Comparative Analysis of Metagenomic Classifiers for Long-read Sequencing Datasets

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2024 Jan 11

PMID 38212694

Authors

Josip Maric

Kresimir Krizanovic

Sylvain Riondet

Niranjan Nagarajan

Mile Sikic

Affiliations

Soon will be listed here.

Abstract

Background: Long reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes.

Results: General-purpose mappers Minimap2 and Ram achieved similar or better accuracy on most testing metrics than best-performing classification tools. They were up to ten times slower than the fastest kmer-based tools requiring up to four times less RAM. All tested tools were prone to report organisms not present in datasets, except CLARK-S, and they underperformed in the case of the high presence of the host's genetic material. Tools which use a protein database performed worse than those based on a nucleotide database. Longer read lengths made classification easier, but due to the difference in read length distributions among species, the usage of only the longest reads reduced the accuracy. The comparison of real gut microbiome datasets shows a similar abundance profiles for the same type of tools but discordance in the number of reported organisms and abundances between types. Most assessments showed the influence of database completeness on the reports.

Conclusion: The findings indicate that kmer-based tools are well-suited for rapid analysis of long reads data. However, when heightened accuracy is essential, mappers demonstrate slightly superior performance, albeit at a considerably slower pace. Nevertheless, a combination of diverse categories of tools and databases will likely be necessary to analyse complex samples. Discrepancies observed among tools when applied to real gut datasets, as well as a reduced performance in cases where unknown species or a significant proportion of the host genome is present in the sample, highlight the need for continuous improvement of existing tools. Additionally, regular updates and curation of databases are important to ensure their effectiveness.

Citing Articles

The Naïve Bayes classifier++ for metagenomic taxonomic classification-query evaluation.

Duan H, Hearne G, Polikar R, Rosen G Bioinformatics. 2024; 41(1).

PMID: 39700412 PMC: 11729721. DOI: 10.1093/bioinformatics/btae743.

Filtering out the noise: metagenomic classifiers optimize ancient DNA mapping.

Ravishankar S, Perez V, Davidson R, Roca-Rada X, Lan D, Souilmi Y Brief Bioinform. 2024; 26(1).

PMID: 39674265 PMC: 11646131. DOI: 10.1093/bib/bbae646.

Oxford Nanopore Technology-Based Identification of an Endosymbiosis in Microbial Keratitis.

Scharf S, Friedrichs L, Bock R, Borrelli M, MacKenzie C, Pfeffer K Microorganisms. 2024; 12(11).

PMID: 39597681 PMC: 11596929. DOI: 10.3390/microorganisms12112292.

MetaAll: integrative bioinformatics workflow for analysing clinical metagenomic data.

Bosilj M, Suljic A, Zakotnik S, Slunecko J, Kogoj R, Korva M Brief Bioinform. 2024; 25(6).

PMID: 39550223 PMC: 11568877. DOI: 10.1093/bib/bbae597.

Evaluating metagenomics and targeted approaches for diagnosis and surveillance of viruses.

Buddle S, Forrest L, Akinsuyi N, Martin Bernal L, Brooks T, Venturini C Genome Med. 2024; 16(1):111.

PMID: 39252069 PMC: 11382446. DOI: 10.1186/s13073-024-01380-x.

References

Menzel P, Ng K, Krogh A . Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016; 7:11257. PMC: 4833860. DOI: 10.1038/ncomms11257. View

Marcelino V, Clausen P, Buchmann J, Wille M, Iredell J, Meyer W . CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data. Genome Biol. 2020; 21(1):103. PMC: 7189439. DOI: 10.1186/s13059-020-02014-2. View

Ferreira-Halder C, Faria A, Andrade S . Action and function of Faecalibacterium prausnitzii in health and disease. Best Pract Res Clin Gastroenterol. 2018; 31(6):643-648. DOI: 10.1016/j.bpg.2017.09.011. View

Kim D, Song L, Breitwieser F, Salzberg S . Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016; 26(12):1721-1729. PMC: 5131823. DOI: 10.1101/gr.210641.116. View

Breitwieser F, Baker D, Salzberg S . KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018; 19(1):198. PMC: 6238331. DOI: 10.1186/s13059-018-1568-0. View

Alpizar-Rodriguez D, Lesker T, Gronow A, Gilbert B, Raemy E, Lamacchia C . in individuals at risk for rheumatoid arthritis. Ann Rheum Dis. 2019; 78(5):590-593. DOI: 10.1136/annrheumdis-2018-214514. View

Pearman W, Freed N, Silander O . Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads. BMC Bioinformatics. 2020; 21(1):220. PMC: 7257156. DOI: 10.1186/s12859-020-3528-4. View

Vacca M, Celano G, Calabrese F, Portincasa P, Gobbetti M, De Angelis M . The Controversial Role of Human Gut Lachnospiraceae. Microorganisms. 2020; 8(4). PMC: 7232163. DOI: 10.3390/microorganisms8040573. View

Sayers E, Barrett T, Benson D, Bolton E, Bryant S, Canese K . Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010; 39(Database issue):D38-51. PMC: 3013733. DOI: 10.1093/nar/gkq1172. View

10.

Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tarraga A, Cheng Y . The European Nucleotide Archive. Nucleic Acids Res. 2010; 39(Database issue):D28-31. PMC: 3013801. DOI: 10.1093/nar/gkq967. View

11.

Karcher N, Pasolli E, Asnicar F, Huang K, Tett A, Manara S . Analysis of 1321 Eubacterium rectale genomes from metagenomes uncovers complex phylogeographic population structure and subspecies functional adaptations. Genome Biol. 2020; 21(1):138. PMC: 7278147. DOI: 10.1186/s13059-020-02042-y. View

12.

Fan J, Huang S, Chorlton S . BugSeq: a highly accurate cloud platform for long-read metagenomic analyses. BMC Bioinformatics. 2021; 22(1):160. PMC: 7993542. DOI: 10.1186/s12859-021-04089-5. View

13.

Gehrig J, Portik D, Driscoll M, Jackson E, Chakraborty S, Gratalo D . Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb Genom. 2022; 8(3). PMC: 9176275. DOI: 10.1099/mgen.0.000794. View

14.

Huson D, Albrecht B, Bagci C, Bessarab I, Gorska A, Jolic D . MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol Direct. 2018; 13(1):6. PMC: 5910613. DOI: 10.1186/s13062-018-0208-7. View

15.

Hamady M, Knight R . Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res. 2009; 19(7):1141-52. PMC: 3776646. DOI: 10.1101/gr.085464.108. View

16.

Lindgreen S, Adair K, Gardner P . An evaluation of the accuracy and speed of metagenome analysis tools. Sci Rep. 2016; 6:19233. PMC: 4726098. DOI: 10.1038/srep19233. View

17.

Mock F, Kretschmer F, Kriese A, Bocker S, Marz M . Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc Natl Acad Sci U S A. 2022; 119(35):e2122636119. PMC: 9436379. DOI: 10.1073/pnas.2122636119. View

18.

Hong C, Manimaran S, Shen Y, Perez-Rogers J, Byrd A, Castro-Nallar E . PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome. 2014; 2:33. PMC: 4164323. DOI: 10.1186/2049-2618-2-33. View

19.

Nicholls S, Quick J, Tang S, Loman N . Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience. 2019; 8(5). PMC: 6520541. DOI: 10.1093/gigascience/giz043. View

20.

Leidenfrost R, Pother D, Jackel U, Wunschiers R . Benchmarking the MinION: Evaluating long reads for microbial profiling. Sci Rep. 2020; 10(1):5125. PMC: 7083898. DOI: 10.1038/s41598-020-61989-x. View