» Articles » PMID: 32957925

Keeping Up with the Genomes: Efficient Learning of Our Increasing Knowledge of the Tree of Life

Overview
Publisher Biomed Central
Specialty Biology
Date 2020 Sep 22
PMID 32957925
Citations 5
Authors
Affiliations
Soon will be listed here.
Abstract

Background: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data.

Results: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 of the non-incremental time with no accuracy loss.

Conclusions: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.

Citing Articles

The Naïve Bayes classifier++ for metagenomic taxonomic classification-query evaluation.

Duan H, Hearne G, Polikar R, Rosen G Bioinformatics. 2024; 41(1).

PMID: 39700412 PMC: 11729721. DOI: 10.1093/bioinformatics/btae743.


MNBC: a multithreaded Minimizer-based Naïve Bayes Classifier for improved metagenomic sequence classification.

Lu R, Dumonceaux T, Anzar M, Zovoilis A, Antonation K, Barker D Bioinformatics. 2024; 40(10).

PMID: 39388213 PMC: 11522871. DOI: 10.1093/bioinformatics/btae601.


YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample.

Koslicki D, White S, Ma C, Novikov A Bioinformatics. 2024; 40(2).

PMID: 38268451 PMC: 10868342. DOI: 10.1093/bioinformatics/btae047.


Improving taxonomic classification with feature space balancing.

Fuhl W, Zabel S, Nieselt K Bioinform Adv. 2023; 3(1):vbad092.

PMID: 37577265 PMC: 10415173. DOI: 10.1093/bioadv/vbad092.


Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering.

Nguyen R, Sokhansanj B, Polikar R, Rosen G PeerJ. 2023; 11:e14779.

PMID: 36785708 PMC: 9921987. DOI: 10.7717/peerj.14779.


References
1.
Clarke E, Taylor L, Zhao C, Connell A, Lee J, Fett B . Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019; 7(1):46. PMC: 6429786. DOI: 10.1186/s40168-019-0658-x. View

2.
Ames S, Hysom D, Gardner S, Lloyd G, Gokhale M, Allen J . Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics. 2013; 29(18):2253-60. PMC: 3753567. DOI: 10.1093/bioinformatics/btt389. View

3.
Kraal L, Abubucker S, Kota K, Fischbach M, Mitreva M . The prevalence of species and strains in the human microbiome: a resource for experimental efforts. PLoS One. 2014; 9(5):e97279. PMC: 4020798. DOI: 10.1371/journal.pone.0097279. View

4.
Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Droge J . Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software. Nat Methods. 2017; 14(11):1063-1071. PMC: 5903868. DOI: 10.1038/nmeth.4458. View

5.
Rosen G, Polikar R, Caseiro D, Essinger S, Sokhansanj B . Discovering the unknown: improving detection of novel species and genera from short reads. J Biomed Biotechnol. 2011; 2011:495849. PMC: 3085467. DOI: 10.1155/2011/495849. View