A Simple Refined DNA Minimizer Operator Enables 2-fold Faster Computation

Overview

Journal Bioinformatics

Publisher Oxford University Press

Specialty Biology

Date 2024 Jan 25

PMID 38269626

Authors

Chenxu Pan

Knut Reinert

Affiliations

Soon will be listed here.

Abstract

Motivation: The minimizer concept is a data structure for sequence sketching. The standard canonical minimizer selects a subset of k-mers from the given DNA sequence by comparing the forward and reverse k-mers in a window simultaneously according to a predefined selection scheme. It is widely employed by sequence analysis such as read mapping and assembly. k-mer density, k-mer repetitiveness (e.g. k-mer bias), and computational efficiency are three critical measurements for minimizer selection schemes. However, there exist trade-offs between kinds of minimizer variants. Generic, effective, and efficient are always the requirements for high-performance minimizer algorithms.

Results: We propose a simple minimizer operator as a refinement of the standard canonical minimizer. It takes only a few operations to compute. However, it can improve the k-mer repetitiveness, especially for the lexicographic order. It applies to other selection schemes of total orders (e.g. random orders). Moreover, it is computationally efficient and the density is close to that of the standard minimizer. The refined minimizer may benefit high-performance applications like binning and read mapping.

Availability And Implementation: The source code of the benchmark in this work is available at the github repository https://github.com/xp3i4/mini_benchmark.

Citing Articles

When less is more: sketching with minimizers in genomics.

Ndiaye M, Prieto-Banos S, Fitzgerald L, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C Genome Biol. 2024; 25(1):270.

PMID: 39402664 PMC: 11472564. DOI: 10.1186/s13059-024-03414-4.

Leaf: an ultrafast filter for population-scale long-read SV detection.

Pan C, Reinert K Genome Biol. 2024; 25(1):155.

PMID: 38872200 PMC: 11170821. DOI: 10.1186/s13059-024-03297-5.

References

Buchler T, Olbrich J, Ohlebusch E . Efficient short read mapping to a pangenome that is represented by a graph of ED strings. Bioinformatics. 2023; 39(5). PMC: 10232250. DOI: 10.1093/bioinformatics/btad320. View

Li H . Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016; 32(14):2103-10. PMC: 4937194. DOI: 10.1093/bioinformatics/btw152. View

Jain C, Rhie A, Zhang H, Chu C, Walenz B, Koren S . Weighted minimizer sampling improves long read mapping. Bioinformatics. 2020; 36(Suppl_1):i111-i118. PMC: 7355284. DOI: 10.1093/bioinformatics/btaa435. View

Chikhi R, Limasset A, Medvedev P . Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016; 32(12):i201-i208. PMC: 4908363. DOI: 10.1093/bioinformatics/btw279. View

Roberts M, Hayes W, Hunt B, Mount S, Yorke J . Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004; 20(18):3363-9. DOI: 10.1093/bioinformatics/bth408. View

Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A . KMC 2: fast and resource-frugal k-mer counting. Bioinformatics. 2015; 31(10):1569-76. DOI: 10.1093/bioinformatics/btv022. View

Mohamadi H, Chu J, Vandervalk B, Birol I . ntHash: recursive nucleotide hashing. Bioinformatics. 2016; 32(22):3492-3494. PMC: 5181554. DOI: 10.1093/bioinformatics/btw397. View

Wood D, Salzberg S . Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):R46. PMC: 4053813. DOI: 10.1186/gb-2014-15-3-r46. View

Sahlin K . Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biol. 2022; 23(1):260. PMC: 9753264. DOI: 10.1186/s13059-022-02831-7. View

10.

Edgar R . Syncmers are more sensitive than minimizers for selecting conserved ‑mers in biological sequences. PeerJ. 2021; 9:e10805. PMC: 7869670. DOI: 10.7717/peerj.10805. View

11.

Marcais G, Pellow D, Bork D, Orenstein Y, Shamir R, Kingsford C . Improving the performance of minimizers and winnowing schemes. Bioinformatics. 2017; 33(14):i110-i117. PMC: 5870760. DOI: 10.1093/bioinformatics/btx235. View

12.

Zheng H, Kingsford C, Marcais G . Sequence-specific minimizers via polar sets. Bioinformatics. 2021; 37(Suppl_1):i187-i195. PMC: 8686682. DOI: 10.1093/bioinformatics/btab313. View

13.

Li H . Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34(18):3094-3100. PMC: 6137996. DOI: 10.1093/bioinformatics/bty191. View