» Articles » PMID: 34667119

Effective Sequence Similarity Detection with Strobemers

Overview
Journal Genome Res
Specialty Genetics
Date 2021 Oct 20
PMID 34667119
Citations 33
Authors
Affiliations
Soon will be listed here.
Abstract

-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate consecutive -mers and make most -mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced -mers and -mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small -mers are commonly used, but these methods first produce -mer matches, and only in a second step, a pairing or grouping of -mers is performed. Such techniques produce many redundant -mer matches owing to the size of Here, we propose as an alternative to -mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter -mers, where the combination of linked -mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than -mers and spaced -mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.

Citing Articles

Sequence similarity estimation by random subsequence sketching.

Chen K, Pattar V, Shao M bioRxiv. 2025; .

PMID: 39975056 PMC: 11839126. DOI: 10.1101/2025.02.05.636706.


NEAR: Neural Embeddings for Amino acid Relationships.

Olson D, Colligan T, Demekas D, Roddy J, Youens-Clark K, Wheeler T bioRxiv. 2025; .

PMID: 39896534 PMC: 11785008. DOI: 10.1101/2024.01.25.577287.


Taming large-scale genomic analyses via sparsified genomics.

Alser M, Eudine J, Mutlu O Nat Commun. 2025; 16(1):876.

PMID: 39837860 PMC: 11751491. DOI: 10.1038/s41467-024-55762-1.


Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment.

Singh N, Khan J, Patro R bioRxiv. 2024; .

PMID: 39677745 PMC: 11642815. DOI: 10.1101/2024.11.27.625771.


When less is more: sketching with minimizers in genomics.

Ndiaye M, Prieto-Banos S, Fitzgerald L, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C Genome Biol. 2024; 25(1):270.

PMID: 39402664 PMC: 11472564. DOI: 10.1186/s13059-024-03414-4.


References
1.
Frith M, Noe L, Kucherov G . Minimally overlapping words for sequence similarity search. Bioinformatics. 2020; 36(22-23):5344-5350. PMC: 8016470. DOI: 10.1093/bioinformatics/btaa1054. View

2.
Berlin K, Koren S, Chin C, Drake J, Landolin J, Phillippy A . Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015; 33(6):623-30. DOI: 10.1038/nbt.3238. View

3.
Shaw J, Yu Y . Theory of local k-mer selection with applications to long-read alignment. Bioinformatics. 2022; 38(20):4659-4669. PMC: 9563685. DOI: 10.1093/bioinformatics/btab790. View

4.
Rangavittal S, Stopa N, Tomaszkiewicz M, Sahlin K, Makova K, Medvedev P . DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies. BMC Genomics. 2019; 20(1):641. PMC: 6688218. DOI: 10.1186/s12864-019-5996-3. View

5.
Kurtz S, Phillippy A, Delcher A, Smoot M, Shumway M, Antonescu C . Versatile and open software for comparing large genomes. Genome Biol. 2004; 5(2):R12. PMC: 395750. DOI: 10.1186/gb-2004-5-2-r12. View