Murasaki: a Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Overview

Journal PLoS One

Specialties General Medicine
Science

Date 2010 Oct 2

PMID 20885980

Citations 17

Authors

Kris Popendorf

Hachiya Tsuyoshi

Yasunori Osana

Yasubumi Sakakibara

Affiliations

Soon will be listed here.

Abstract

Background: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows.

Methodology/principal Findings: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy.

Conclusions/significance: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.

Citing Articles

Comparative genomics of , a fast-growing pathogen of wild .

Baby V, Ambroset C, Gaurivaud P, Falquet L, Boury C, Guichoux E Microb Genom. 2023; 9(10).

PMID: 37823548 PMC: 10634449. DOI: 10.1099/mgen.0.001112.

Hamster PIWI proteins bind to piRNAs with stage-specific size variations during oocyte maturation.

Ishino K, Hasuwa H, Yoshimura J, Iwasaki Y, Nishihara H, Seki N Nucleic Acids Res. 2021; 49(5):2700-2720.

PMID: 33590099 PMC: 7969018. DOI: 10.1093/nar/gkab059.

Genomic Characteristics of the Toxic Bloom-Forming Cyanobacterium NIES-102.

Yamaguchi H, Suzuki S, Osana Y, Kawachi M J Genomics. 2020; 8:1-6.

PMID: 31892993 PMC: 6930136. DOI: 10.7150/jgen.40978.

Inferring the Minimal Genome of by Comparative Genomics and Transposon Mutagenesis.

Baby V, Lachance J, Gagnon J, Lucier J, Matteau D, Knight T mSystems. 2018; 3(3).

PMID: 29657968 PMC: 5893858. DOI: 10.1128/mSystems.00198-17.

Complete Genome Sequence of NIES-2481 and Common Genomic Features of Group G .

Yamaguchi H, Suzuki S, Osana Y, Kawachi M J Genomics. 2018; 6:30-33.

PMID: 29576807 PMC: 5865083. DOI: 10.7150/jgen.24935.

References

Liolios K, Mavromatis K, Tavernarakis N, Kyrpides N . The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2007; 36(Database issue):D475-9. PMC: 2238992. DOI: 10.1093/nar/gkm884. View

Dewey C, Huggins P, Woods K, Sturmfels B, Pachter L . Parametric alignment of Drosophila genomes. PLoS Comput Biol. 2006; 2(6):e73. PMC: 1480539. DOI: 10.1371/journal.pcbi.0020073. View

Blanchette M, Kent W, Riemer C, Elnitski L, Smit A, Roskin K . Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004; 14(4):708-15. PMC: 383317. DOI: 10.1101/gr.1933104. View

Hachiya T, Osana Y, Popendorf K, Sakakibara Y . Accurate identification of orthologous segments among multiple genomes. Bioinformatics. 2009; 25(7):853-60. DOI: 10.1093/bioinformatics/btp070. View

Delcher A, Kasif S, Fleischmann R, Peterson J, White O, Salzberg S . Alignment of whole genomes. Nucleic Acids Res. 1999; 27(11):2369-76. PMC: 148804. DOI: 10.1093/nar/27.11.2369. View

Smith T, Waterman M . Identification of common molecular subsequences. J Mol Biol. 1981; 147(1):195-7. DOI: 10.1016/0022-2836(81)90087-5. View

Gibbs R, Weinstock G, Metzker M, Muzny D, Sodergren E, Scherer S . Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 2004; 428(6982):493-521. DOI: 10.1038/nature02426. View

Bourque G, Pevzner P . Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 2002; 12(1):26-36. PMC: 155248. View

Brudno M, Do C, Cooper G, Kim M, Davydov E, Green E . LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003; 13(4):721-31. PMC: 430158. DOI: 10.1101/gr.926603. View

10.

Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I . ABySS: a parallel assembler for short read sequence data. Genome Res. 2009; 19(6):1117-23. PMC: 2694472. DOI: 10.1101/gr.089532.108. View

11.

Altschul S, Gish W, Miller W, Myers E, Lipman D . Basic local alignment search tool. J Mol Biol. 1990; 215(3):403-10. DOI: 10.1016/S0022-2836(05)80360-2. View

12.

Waterston R, Lindblad-Toh K, Birney E, Rogers J, Abril J, Agarwal P . Initial sequencing and comparative analysis of the mouse genome. Nature. 2002; 420(6915):520-62. DOI: 10.1038/nature01262. View

13.

Miller W, Rosenbloom K, Hardison R, Hou M, Taylor J, Raney B . 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 2007; 17(12):1797-808. PMC: 2099589. DOI: 10.1101/gr.6761107. View

14.

Pevzner P, Tesler G . Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 2003; 13(1):37-45. PMC: 430962. DOI: 10.1101/gr.757503. View

15.

Ohlebusch E, Kurtz S . Space efficient computation of rare maximal exact matches between multiple sequences. J Comput Biol. 2008; 15(4):357-77. DOI: 10.1089/cmb.2007.0105. View

16.

Preparata F, Zhang L, Choi K . Quick, practical selection of effective seeds for homology search. J Comput Biol. 2005; 12(9):1137-52. DOI: 10.1089/cmb.2005.12.1137. View

17.

Pearson W, Lipman D . Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988; 85(8):2444-8. PMC: 280013. DOI: 10.1073/pnas.85.8.2444. View

18.

Ma B, Tromp J, Li M . PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002; 18(3):440-5. DOI: 10.1093/bioinformatics/18.3.440. View

19.

Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L . The Ensembl genome database project. Nucleic Acids Res. 2001; 30(1):38-41. PMC: 99161. DOI: 10.1093/nar/30.1.38. View

20.

Bejerano G, Pheasant M, Makunin I, Stephen S, Kent W, Mattick J . Ultraconserved elements in the human genome. Science. 2004; 304(5675):1321-5. DOI: 10.1126/science.1098119. View