» Articles » PMID: 11331236

Efficient Large-scale Sequence Comparison by Locality-sensitive Hashing

Overview
Journal Bioinformatics
Specialty Biology
Date 2001 May 2
PMID 11331236
Citations 30
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches.

Results: We introduce a new algorithm, LSH-ALL-PAIRS, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found LSH-ALL-PAIRS to be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length

Citing Articles

Single-cell omics: experimental workflow, data analyses and applications.

Sun F, Li H, Sun D, Fu S, Gu L, Shao X Sci China Life Sci. 2024; 68(1):5-102.

PMID: 39060615 DOI: 10.1007/s11427-023-2561-0.


GradHC: highly reliable gradual hash-based clustering for DNA storage systems.

Ben Shabat D, Hadad A, Boruchovsky A, Yaakobi E Bioinformatics. 2024; 40(5).

PMID: 38648049 PMC: 11653902. DOI: 10.1093/bioinformatics/btae274.


MIKE: an ultrafast, assembly-, and alignment-free approach for phylogenetic tree construction.

Wang F, Wang Y, Zeng X, Zhang S, Yu J, Li D Bioinformatics. 2024; 40(4).

PMID: 38547397 PMC: 10990684. DOI: 10.1093/bioinformatics/btae154.


CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing.

Sapci A, Rachtman E, Mirarab S Bioinformatics. 2024; 40(4).

PMID: 38492564 PMC: 10985673. DOI: 10.1093/bioinformatics/btae150.


NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search.

Sarumi O, Hahn M, Heider D Comput Struct Biotechnol J. 2024; 23:732-741.

PMID: 38298179 PMC: 10828564. DOI: 10.1016/j.csbj.2023.12.046.