Estimating the Total Genome Length of a Metagenomic Sample Using K-mers

Overview

Journal BMC Genomics

Publisher Biomed Central

Specialty Genetics

Date 2019 Apr 11

PMID 30967110

Citations 1

Authors

Kui Hua

Xuegong Zhang

Affiliations

Soon will be listed here.

Abstract

Background: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.

Results: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.

Conclusions: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

Citing Articles

Enhancing Clinical Utility: Utilization of International Standards and Guidelines for Metagenomic Sequencing in Infectious Disease Diagnosis.

Kan C, Tsang H, Pei X, Ng S, Yim A, Yu A Int J Mol Sci. 2024; 25(6).

PMID: 38542307 PMC: 10970082. DOI: 10.3390/ijms25063333.

References

Pruitt K, Tatusova T, Maglott D . NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2006; 35(Database issue):D61-5. PMC: 1716718. DOI: 10.1093/nar/gkl842. View

Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy A . Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007; 4(6):495-500. DOI: 10.1038/nmeth1043. View

Turnbaugh P, Ley R, Hamady M, Fraser-Liggett C, Knight R, Gordon J . The human microbiome project. Nature. 2007; 449(7164):804-10. PMC: 3709439. DOI: 10.1038/nature06244. View

Hooper S, Dalevi D, Pati A, Mavromatis K, Ivanova N, Kyrpides N . Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics. 2009; 26(3):295-301. PMC: 2815663. DOI: 10.1093/bioinformatics/btp687. View

Marcais G, Kingsford C . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27(6):764-70. PMC: 3051319. DOI: 10.1093/bioinformatics/btr011. View

Gordon J . Honor thy gut symbionts redux. Science. 2012; 336(6086):1251-3. DOI: 10.1126/science.1224686. View

Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C . Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012; 9(8):811-4. PMC: 3443552. DOI: 10.1038/nmeth.2066. View

Wendl M, Kota K, Weinstock G, Mitreva M . Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens' theorem. J Math Biol. 2012; 67(5):1141-61. PMC: 3795925. DOI: 10.1007/s00285-012-0586-x. View

Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F . A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012; 490(7418):55-60. DOI: 10.1038/nature11450. View

10.

Daley T, Smith A . Predicting the molecular complexity of sequencing libraries. Nat Methods. 2013; 10(4):325-7. PMC: 3612374. DOI: 10.1038/nmeth.2375. View

11.

Tamames J, De La Pena S, de Lorenzo V . COVER: a priori estimation of coverage for metagenomic sequencing. Environ Microbiol Rep. 2013; 4(3):335-41. DOI: 10.1111/j.1758-2229.2012.00338.x. View

12.

Rodriguez-R L, Konstantinidis K . Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics. 2013; 30(5):629-35. DOI: 10.1093/bioinformatics/btt584. View

13.

Rodriguez-R L, Konstantinidis K . Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014; 8(11):2349-51. PMC: 4992084. DOI: 10.1038/ismej.2014.76. View

14.

Daley T, Smith A . Modeling genome coverage in single-cell sequencing. Bioinformatics. 2014; 30(22):3159-65. PMC: 4221128. DOI: 10.1093/bioinformatics/btu540. View

15.

Oh J, Byrd A, Deming C, Conlan S, Kong H, Segre J . Biogeography and individuality shape function in the human skin metagenome. Nature. 2014; 514(7520):59-64. PMC: 4185404. DOI: 10.1038/nature13786. View

16.

Marinier E, Brown D, McConkey B . Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics. 2015; 16:10. PMC: 4307147. DOI: 10.1186/s12859-014-0435-6. View

17.

Freitas T, Li P, Scholz M, Chain P . Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 2015; 43(10):e69. PMC: 4446416. DOI: 10.1093/nar/gkv180. View

18.

Truong D, Franzosa E, Tickle T, Scholz M, Weingart G, Pasolli E . MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015; 12(10):902-3. DOI: 10.1038/nmeth.3589. View

19.

Falony G, Joossens M, Vieira-Silva S, Wang J, Darzi Y, Faust K . Population-level analysis of gut microbiome variation. Science. 2016; 352(6285):560-4. DOI: 10.1126/science.aad3503. View

20.

Zhernakova A, Kurilshikov A, Bonder M, Tigchelaar E, Schirmer M, Vatanen T . Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016; 352(6285):565-9. PMC: 5240844. DOI: 10.1126/science.aad3369. View