A Benchmark Study of K-mer Counting Methods for High-throughput Sequencing

Overview

Journal Gigascience

Publisher Oxford University Press

Specialties Biology
Genetics

Date 2018 Oct 23

PMID 30346548

Citations 34

Authors

Swati C Manekar

Shailesh R Sathe

Affiliations

Soon will be listed here.

Abstract

The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.

Citing Articles

The genomes of the most diverse AA genome rice species provide a resource for rice improvement and studies of rice evolution and domestication.

Abdullah M, Furtado A, Masouleh A, Okemo P, Henry R BMC Genomics. 2025; 26(1):54.

PMID: 39838314 PMC: 11748844. DOI: 10.1186/s12864-025-11246-0.

Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method.

Luleci H, Ari Yuka S, Yilmaz A Interdiscip Sci. 2024; .

PMID: 39432054 DOI: 10.1007/s12539-024-00659-2.

The genomes of Australian wild limes.

Nakandala U, Furtado A, Masouleh A, Smith M, Mason P, Williams D Plant Mol Biol. 2024; 114(5):102.

PMID: 39316221 PMC: 11422456. DOI: 10.1007/s11103-024-01502-4.

A survey of k-mer methods and applications in bioinformatics.

Moeckel C, Mareboina M, Konnaris M, Chan C, Mouratidis I, Montgomery A Comput Struct Biotechnol J. 2024; 23:2289-2303.

PMID: 38840832 PMC: 11152613. DOI: 10.1016/j.csbj.2024.05.025.

The genome of Citrus australasica reveals disease resistance and other species specific genes.

Nakandala U, Furtado A, Masouleh A, Smith M, Williams D, Henry R BMC Plant Biol. 2024; 24(1):260.

PMID: 38594608 PMC: 11005238. DOI: 10.1186/s12870-024-04988-8.

References

Kelley D, Schatz M, Salzberg S . Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010; 11(11):R116. PMC: 3156955. DOI: 10.1186/gb-2010-11-11-r116. View

Li R, Ye J, Li S, Wang J, Han Y, Ye C . ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLoS Comput Biol. 2005; 1(4):e43. PMC: 1232128. DOI: 10.1371/journal.pcbi.0010043. View

Reuter J, Spacek D, Snyder M . High-throughput sequencing technologies. Mol Cell. 2015; 58(4):586-97. PMC: 4494749. DOI: 10.1016/j.molcel.2015.05.004. View

Miller J, Delcher A, Koren S, Venter E, Walenz B, Brownley A . Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008; 24(24):2818-24. PMC: 2639302. DOI: 10.1093/bioinformatics/btn548. View

Mapleson D, Accinelli G, Kettleborough G, Wright J, Clavijo B . KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 2016; 33(4):574-576. PMC: 5408915. DOI: 10.1093/bioinformatics/btw663. View

Campagna D, Romualdi C, Vitulo N, Del Favero M, Lexa M, Cannata N . RAP: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics. 2004; 21(5):582-8. DOI: 10.1093/bioinformatics/bti039. View

Erbert M, Rechner S, Muller-Hannemann M . Gerbil: a fast and memory-efficient -mer counter with GPU-support. Algorithms Mol Biol. 2017; 12:9. PMC: 5374613. DOI: 10.1186/s13015-017-0097-9. View

Roberts R, Carneiro M, Schatz M . The advantages of SMRT sequencing. Genome Biol. 2013; 14(7):405. PMC: 3953343. DOI: 10.1186/gb-2013-14-6-405. View

Roy R, Bhattacharya D, Schliep A . Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics. 2014; 30(14):1950-7. DOI: 10.1093/bioinformatics/btu132. View

10.

Mamun A, Pal S, Rajasekaran S . KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics. 2016; 32(18):2783-90. PMC: 5939891. DOI: 10.1093/bioinformatics/btw345. View

11.

Zerbino D, Birney E . Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18(5):821-9. PMC: 2336801. DOI: 10.1101/gr.074492.107. View

12.

Sindi S, Hunt B, Yorke J . Duplication count distributions in DNA sequences. Phys Rev E Stat Nonlin Soft Matter Phys. 2009; 78(6 Pt 1):061912. PMC: 3121164. DOI: 10.1103/PhysRevE.78.061912. View

13.

Pajuste F, Kaplinski L, Mols M, Puurand T, Lepamets M, Remm M . FastGT: an alignment-free method for calling common SNVs directly from raw sequencing reads. Sci Rep. 2017; 7(1):2537. PMC: 5451431. DOI: 10.1038/s41598-017-02487-5. View

14.

Marcais G, Kingsford C . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27(6):764-70. PMC: 3051319. DOI: 10.1093/bioinformatics/btr011. View

15.

Melsted P, Pritchard J . Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011; 12:333. PMC: 3166945. DOI: 10.1186/1471-2105-12-333. View

16.

Sameith K, Roscito J, Hiller M . Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform. 2016; 18(1):1-8. PMC: 5221426. DOI: 10.1093/bib/bbw003. View

17.

Edgar R . MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792-7. PMC: 390337. DOI: 10.1093/nar/gkh340. View

18.

Myers E, Sutton G, Delcher A, Dew I, Fasulo D, Flanigan M . A whole-genome assembly of Drosophila. Science. 2000; 287(5461):2196-204. DOI: 10.1126/science.287.5461.2196. View

19.

Audano P, Vannberg F . KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics. 2014; 30(14):2070-2. PMC: 4080738. DOI: 10.1093/bioinformatics/btu152. View

20.

Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I . ABySS: a parallel assembler for short read sequence data. Genome Res. 2009; 19(6):1117-23. PMC: 2694472. DOI: 10.1101/gr.089532.108. View