Efficient Computation of Spaced Seeds

Overview

Journal BMC Res Notes

Publisher Biomed Central

Specialties Biology
General Medicine

Date 2012 Mar 1

PMID 22373455

Citations 5

Authors

Silvana Ilie

Affiliations

Soon will be listed here.

Abstract

Background: The most frequently used tools in bioinformatics are those searching for similarities, or local alignments, between biological sequences. Since the exact dynamic programming algorithm is quadratic, linear-time heuristics such as BLAST are used. Spaced seeds are much more sensitive than the consecutive seed of BLAST and using several seeds represents the current state of the art in approximate search for biological sequences. The most important aspect is computing highly sensitive seeds. Since the problem seems hard, heuristic algorithms are used. The leading software in the common Bernoulli model is the SpEED program.

Findings: SpEED uses a hill climbing method based on the overlap complexity heuristic. We propose a new algorithm for this heuristic that improves its speed by over one order of magnitude. We use the new implementation to compute improved seeds for several software programs. We compute as well multiple seeds of the same weight as MegaBLAST, that greatly improve its sensitivity.

Conclusion: Multiple spaced seeds are being successfully used in bioinformatics software programs. Enabling researchers to compute very fast high quality seeds will help expanding the range of their applications.

Citing Articles

PerFSeeB: designing long high-weight single spaced seeds for full sensitivity alignment with a given number of mismatches.

Titarenko V, Titarenko S BMC Bioinformatics. 2023; 24(1):396.

PMID: 37875804 PMC: 10594774. DOI: 10.1186/s12859-023-05517-4.

A survey of mapping algorithms in the long-reads era.

Sahlin K, Baudeau T, Cazaux B, Marchet C Genome Biol. 2023; 24(1):133.

PMID: 37264447 PMC: 10236595. DOI: 10.1186/s13059-023-02972-3.

'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees.

Dencker T, Leimeister C, Gerth M, Bleidorn C, Snir S, Morgenstern B NAR Genom Bioinform. 2021; 2(1):lqz013.

PMID: 33575565 PMC: 7671388. DOI: 10.1093/nargab/lqz013.

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.

Elworth R, Wang Q, Kota P, Barberan C, Coleman B, Balaji A Nucleic Acids Res. 2020; 48(10):5217-5234.

PMID: 32338745 PMC: 7261164. DOI: 10.1093/nar/gkaa265.

Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.

Noe L Algorithms Mol Biol. 2017; 12:1.

PMID: 28289437 PMC: 5310094. DOI: 10.1186/s13015-017-0092-1.

References

Lipman D, Pearson W . Rapid and sensitive protein similarity searches. Science. 1985; 227(4693):1435-41. DOI: 10.1126/science.2983426. View

Ilie L, Ilie S, Bigvand A . SpEED: fast computation of sensitive spaced seeds. Bioinformatics. 2011; 27(17):2433-4. DOI: 10.1093/bioinformatics/btr368. View

Ma B, Tromp J, Li M . PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002; 18(3):440-5. DOI: 10.1093/bioinformatics/18.3.440. View

Homer N, Merriman B, Nelson S . BFAST: an alignment tool for large scale genome resequencing. PLoS One. 2009; 4(11):e7767. PMC: 2770639. DOI: 10.1371/journal.pone.0007767. View

Rumble S, Lacroute P, Dalca A, Fiume M, Sidow A, Brudno M . SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol. 2009; 5(5):e1000386. PMC: 2678294. DOI: 10.1371/journal.pcbi.1000386. View

Ilie L, Ilie S . Multiple spaced seeds for homology search. Bioinformatics. 2007; 23(22):2969-77. DOI: 10.1093/bioinformatics/btm422. View

Buhler J . Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics. 2001; 17(5):419-28. DOI: 10.1093/bioinformatics/17.5.419. View

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389-402. PMC: 146917. DOI: 10.1093/nar/25.17.3389. View

Kucherov G, Noe L, Roytberg M . A unifying framework for seed sensitivity and its application to subset seeds. J Bioinform Comput Biol. 2006; 4(2):553-69. PMC: 2824148. DOI: 10.1142/s0219720006001977. View

10.

Califano A, Rigoutsos I . FLASH: a fast look-up algorithm for string homology. Proc Int Conf Intell Syst Mol Biol. 1993; 1:56-64. View

11.

Noe L, Kucherov G . YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res. 2005; 33(Web Server issue):W540-3. PMC: 1160238. DOI: 10.1093/nar/gki478. View

12.

Feng S, Tillier E . A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics. 2007; 23(10):1195-202. DOI: 10.1093/bioinformatics/btm114. View

13.

Altschul S, Gish W, Miller W, Myers E, Lipman D . Basic local alignment search tool. J Mol Biol. 1990; 215(3):403-10. DOI: 10.1016/S0022-2836(05)80360-2. View

14.

Li M, Ma B, Kisman D, Tromp J . Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol. 2004; 2(3):417-39. DOI: 10.1142/s0219720004000661. View