These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2014 Jul 26

PMID 25062443

Citations 35

Authors

Qingpeng Zhang

Jason Pell

Rosangela Canino-Koning

Adina Chuang Howe

C Titus Brown

Affiliations

Soon will be listed here.

Abstract

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

Citing Articles

An alignment-free method for phylogeny estimation using maximum likelihood.

Zahin T, Abrar M, Jewel M, Tasnim T, Bayzid M, Rahman A BMC Bioinformatics. 2025; 26(1):77.

PMID: 40055594 PMC: 11887328. DOI: 10.1186/s12859-025-06080-w.

Hookworm genes encoding intestinal excreted-secreted proteins are transcriptionally upregulated in response to the host's immune system.

Schwarz E, Noon J, Chicca J, Garceau C, Li H, Antoshechkin I bioRxiv. 2025; .

PMID: 39975173 PMC: 11838427. DOI: 10.1101/2025.02.01.636063.

Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.

Luleci H, Ari Yuka S, Yilmaz A Interdiscip Sci. 2024; .

PMID: 39432054 DOI: 10.1007/s12539-024-00659-2.

MerCat2: a versatile -mer counter and diversity estimator for database-independent property analysis obtained from omics data.

Figueroa 3rd J, Redinbo A, Panyala A, Colby S, Friesen M, Tiemann L Bioinform Adv. 2024; 4(1):vbae061.

PMID: 38745763 PMC: 11090762. DOI: 10.1093/bioadv/vbae061.

A CNN based m5c RNA methylation predictor.

Aslam I, Shah S, Jabeen S, ElAffendi M, Abdel Latif A, Ul Haq N Sci Rep. 2023; 13(1):21885.

PMID: 38081880 PMC: 10713599. DOI: 10.1038/s41598-023-48751-9.

References

Metzker M . Sequencing technologies - the next generation. Nat Rev Genet. 2009; 11(1):31-46. DOI: 10.1038/nrg2626. View

Melsted P, Pritchard J . Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011; 12:333. PMC: 3166945. DOI: 10.1186/1471-2105-12-333. View

Roy R, Bhattacharya D, Schliep A . Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics. 2014; 30(14):1950-7. DOI: 10.1093/bioinformatics/btu132. View

Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje J, Brown C . Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci U S A. 2012; 109(33):13272-7. PMC: 3421212. DOI: 10.1073/pnas.1121464109. View

Audano P, Vannberg F . KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics. 2014; 30(14):2070-2. PMC: 4080738. DOI: 10.1093/bioinformatics/btu152. View

Jones D, Ruzzo W, Peng X, Katze M . Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012; 40(22):e171. PMC: 3526293. DOI: 10.1093/nar/gks754. View

Rizk G, Lavenier D, Chikhi R . DSK: k-mer counting with very low memory usage. Bioinformatics. 2013; 29(5):652-3. DOI: 10.1093/bioinformatics/btt020. View

Crusoe M, Alameldin H, Awad S, Boucher E, Caldwell A, Cartwright R . The khmer software package: enabling efficient nucleotide sequence analysis. F1000Res. 2015; 4:900. PMC: 4608353. DOI: 10.12688/f1000research.6924.1. View

Minoche A, Dohm J, Himmelbauer H . Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011; 12(11):R112. PMC: 3334598. DOI: 10.1186/gb-2011-12-11-r112. View

10.

Deorowicz S, Debudaj-Grabysz A, Grabowski S . Disk-based k-mer counting on a PC. BMC Bioinformatics. 2013; 14:160. PMC: 3680041. DOI: 10.1186/1471-2105-14-160. View

11.

Kurtz S, Narechania A, Stein J, Ware D . A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008; 9:517. PMC: 2613927. DOI: 10.1186/1471-2164-9-517. View

12.

Haas B, Papanicolaou A, Yassour M, Grabherr M, Blood P, Bowden J . De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013; 8(8):1494-512. PMC: 3875132. DOI: 10.1038/nprot.2013.084. View

13.

Li X, Waterman M . Estimating the repeat structure and length of DNA sequences using L-tuples. Genome Res. 2003; 13(8):1916-22. PMC: 403783. DOI: 10.1101/gr.1251803. View

14.

Chitsaz H, Yee-Greenbaum J, Tesler G, Lombardo M, Dupont C, Badger J . Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol. 2011; 29(10):915-21. PMC: 3558281. DOI: 10.1038/nbt.1966. View

15.

Qin J, Li R, Raes J, Arumugam M, Burgdorf K, Manichanh C . A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010; 464(7285):59-65. PMC: 3779803. DOI: 10.1038/nature08821. View

16.

Howe A, Jansson J, Malfatti S, Tringe S, Tiedje J, Brown C . Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A. 2014; 111(13):4904-9. PMC: 3977251. DOI: 10.1073/pnas.1402564111. View

17.

Luo W, Friedman M, Shedden K, Hankenson K, Woolf P . GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics. 2009; 10:161. PMC: 2696452. DOI: 10.1186/1471-2105-10-161. View

18.

Zerbino D, Birney E . Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18(5):821-9. PMC: 2336801. DOI: 10.1101/gr.074492.107. View

19.

Chikhi R, Medvedev P . Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2013; 30(1):31-7. DOI: 10.1093/bioinformatics/btt310. View

20.

Pevzner P, Tang H, Waterman M . An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001; 98(17):9748-53. PMC: 55524. DOI: 10.1073/pnas.171285098. View