Gerbil: a Fast and Memory-efficient -mer Counter with GPU-support
Overview
Affiliations
Background: A basic task in bioinformatics is the counting of -mers in genome sequences. Existing -mer counting tools are most often optimized for small < 32 and suffer from excessive memory resource consumption or degrading performance for large . However, given the technology trend towards long reads of next-generation sequencers, support for large becomes increasingly important.
Results: We present the open source -mer counting software that has been designed for the efficient counting of -mers for ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the -mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that is able to efficiently support both small and large .
Conclusions: While 's performance is comparable to existing state-of-the-art open source -mer counting tools for small < 32, it vastly outperforms its competitors for large , thereby enabling new applications which require large values of .
MAFcounter: An efficient tool for counting the occurrences of k-mers in MAF files.
Patsakis M, Provatas K, Mouratidis I, Georgakopoulos-Soares I ArXiv. 2024; .
PMID: 39650609 PMC: 11623707.
When less is more: sketching with minimizers in genomics.
Ndiaye M, Prieto-Banos S, Fitzgerald L, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C Genome Biol. 2024; 25(1):270.
PMID: 39402664 PMC: 11472564. DOI: 10.1186/s13059-024-03414-4.
A survey of k-mer methods and applications in bioinformatics.
Moeckel C, Mareboina M, Konnaris M, Chan C, Mouratidis I, Montgomery A Comput Struct Biotechnol J. 2024; 23:2289-2303.
PMID: 38840832 PMC: 11152613. DOI: 10.1016/j.csbj.2024.05.025.
Space-efficient computation of k-mer dictionaries for large values of k.
Diaz-Dominguez D, Leinonen M, Salmela L Algorithms Mol Biol. 2024; 19(1):14.
PMID: 38581000 PMC: 10996146. DOI: 10.1186/s13015-024-00259-1.
Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.
Hoang M, Marcais G, Kingsford C J Comput Biol. 2023; 31(1):2-20.
PMID: 37975802 PMC: 10794853. DOI: 10.1089/cmb.2023.0212.