» Articles » PMID: 30346548

A Benchmark Study of K-mer Counting Methods for High-throughput Sequencing

Overview
Journal Gigascience
Specialties Biology
Genetics
Date 2018 Oct 23
PMID 30346548
Citations 34
Authors
Affiliations
Soon will be listed here.
Abstract

The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.

Citing Articles

The genomes of the most diverse AA genome rice species provide a resource for rice improvement and studies of rice evolution and domestication.

Abdullah M, Furtado A, Masouleh A, Okemo P, Henry R BMC Genomics. 2025; 26(1):54.

PMID: 39838314 PMC: 11748844. DOI: 10.1186/s12864-025-11246-0.


Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.

Luleci H, Ari Yuka S, Yilmaz A Interdiscip Sci. 2024; .

PMID: 39432054 DOI: 10.1007/s12539-024-00659-2.


The genomes of Australian wild limes.

Nakandala U, Furtado A, Masouleh A, Smith M, Mason P, Williams D Plant Mol Biol. 2024; 114(5):102.

PMID: 39316221 PMC: 11422456. DOI: 10.1007/s11103-024-01502-4.


A survey of k-mer methods and applications in bioinformatics.

Moeckel C, Mareboina M, Konnaris M, Chan C, Mouratidis I, Montgomery A Comput Struct Biotechnol J. 2024; 23:2289-2303.

PMID: 38840832 PMC: 11152613. DOI: 10.1016/j.csbj.2024.05.025.


The genome of Citrus australasica reveals disease resistance and other species specific genes.

Nakandala U, Furtado A, Masouleh A, Smith M, Williams D, Henry R BMC Plant Biol. 2024; 24(1):260.

PMID: 38594608 PMC: 11005238. DOI: 10.1186/s12870-024-04988-8.


References
1.
Kelley D, Schatz M, Salzberg S . Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010; 11(11):R116. PMC: 3156955. DOI: 10.1186/gb-2010-11-11-r116. View

2.
Li R, Ye J, Li S, Wang J, Han Y, Ye C . ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol. 2005; 1(4):e43. PMC: 1232128. DOI: 10.1371/journal.pcbi.0010043. View

3.
Reuter J, Spacek D, Snyder M . High-throughput sequencing technologies. Mol Cell. 2015; 58(4):586-97. PMC: 4494749. DOI: 10.1016/j.molcel.2015.05.004. View

4.
Miller J, Delcher A, Koren S, Venter E, Walenz B, Brownley A . Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008; 24(24):2818-24. PMC: 2639302. DOI: 10.1093/bioinformatics/btn548. View

5.
Mapleson D, Accinelli G, Kettleborough G, Wright J, Clavijo B . KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 2016; 33(4):574-576. PMC: 5408915. DOI: 10.1093/bioinformatics/btw663. View