» Articles » PMID: 25062443

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

Overview
Journal PLoS One
Date 2014 Jul 26
PMID 25062443
Citations 35
Authors
Affiliations
Soon will be listed here.
Abstract

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

Citing Articles

An alignment-free method for phylogeny estimation using maximum likelihood.

Zahin T, Abrar M, Jewel M, Tasnim T, Bayzid M, Rahman A BMC Bioinformatics. 2025; 26(1):77.

PMID: 40055594 PMC: 11887328. DOI: 10.1186/s12859-025-06080-w.


Hookworm genes encoding intestinal excreted-secreted proteins are transcriptionally upregulated in response to the host's immune system.

Schwarz E, Noon J, Chicca J, Garceau C, Li H, Antoshechkin I bioRxiv. 2025; .

PMID: 39975173 PMC: 11838427. DOI: 10.1101/2025.02.01.636063.


Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.

Luleci H, Ari Yuka S, Yilmaz A Interdiscip Sci. 2024; .

PMID: 39432054 DOI: 10.1007/s12539-024-00659-2.


MerCat2: a versatile -mer counter and diversity estimator for database-independent property analysis obtained from omics data.

Figueroa 3rd J, Redinbo A, Panyala A, Colby S, Friesen M, Tiemann L Bioinform Adv. 2024; 4(1):vbae061.

PMID: 38745763 PMC: 11090762. DOI: 10.1093/bioadv/vbae061.


A CNN based m5c RNA methylation predictor.

Aslam I, Shah S, Jabeen S, ElAffendi M, Abdel Latif A, Ul Haq N Sci Rep. 2023; 13(1):21885.

PMID: 38081880 PMC: 10713599. DOI: 10.1038/s41598-023-48751-9.


References
1.
Metzker M . Sequencing technologies - the next generation. Nat Rev Genet. 2009; 11(1):31-46. DOI: 10.1038/nrg2626. View

2.
Melsted P, Pritchard J . Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011; 12:333. PMC: 3166945. DOI: 10.1186/1471-2105-12-333. View

3.
Roy R, Bhattacharya D, Schliep A . Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics. 2014; 30(14):1950-7. DOI: 10.1093/bioinformatics/btu132. View

4.
Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje J, Brown C . Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci U S A. 2012; 109(33):13272-7. PMC: 3421212. DOI: 10.1073/pnas.1121464109. View

5.
Audano P, Vannberg F . KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics. 2014; 30(14):2070-2. PMC: 4080738. DOI: 10.1093/bioinformatics/btu152. View