» Articles » PMID: 30967110

Estimating the Total Genome Length of a Metagenomic Sample Using K-mers

Overview
Journal BMC Genomics
Publisher Biomed Central
Specialty Genetics
Date 2019 Apr 11
PMID 30967110
Citations 1
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.

Results: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.

Conclusions: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

Citing Articles

Enhancing Clinical Utility: Utilization of International Standards and Guidelines for Metagenomic Sequencing in Infectious Disease Diagnosis.

Kan C, Tsang H, Pei X, Ng S, Yim A, Yu A Int J Mol Sci. 2024; 25(6).

PMID: 38542307 PMC: 10970082. DOI: 10.3390/ijms25063333.

References
1.
Pruitt K, Tatusova T, Maglott D . NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2006; 35(Database issue):D61-5. PMC: 1716718. DOI: 10.1093/nar/gkl842. View

2.
Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy A . Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007; 4(6):495-500. DOI: 10.1038/nmeth1043. View

3.
Turnbaugh P, Ley R, Hamady M, Fraser-Liggett C, Knight R, Gordon J . The human microbiome project. Nature. 2007; 449(7164):804-10. PMC: 3709439. DOI: 10.1038/nature06244. View

4.
Hooper S, Dalevi D, Pati A, Mavromatis K, Ivanova N, Kyrpides N . Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics. 2009; 26(3):295-301. PMC: 2815663. DOI: 10.1093/bioinformatics/btp687. View

5.
Marcais G, Kingsford C . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27(6):764-70. PMC: 3051319. DOI: 10.1093/bioinformatics/btr011. View