High Performance Multiple Sequence Alignment System for Pyrosequencing Reads from Multiple Reference Genomes

Overview

Journal J Parallel Distrib Comput

Date 2012 Nov 6

PMID 23125479

Citations 3

Authors

Fahad Saeed

Alan Perez-Rathke

Jaroslaw Gwarnicki

Tanya Berger-Wolf

Ashfaq Khokhar

Affiliations

Soon will be listed here.

Abstract

Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.

Citing Articles

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey.

Usman Tariq M, Haseeb M, Aledhari M, Razzak R, Parizi R, Saeed F IEEE Access. 2021; 9:5497-5516.

PMID: 33537181 PMC: 7853650. DOI: 10.1109/ACCESS.2020.3047588.

Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study.

Eslami T, Saeed F High Throughput. 2018; 7(2).

PMID: 29677161 PMC: 6023306. DOI: 10.3390/ht7020011.

PhosSA: Fast and accurate phosphorylation site assignment algorithm for mass spectrometry data.

Saeed F, Pisitkun T, Hoffert J, Rashidian S, Wang G, Gucek M Proteome Sci. 2014; 11(Suppl 1):S14.

PMID: 24565028 PMC: 3909108. DOI: 10.1186/1477-5956-11-S1-S14.

References

Ning Z, Cox A, Mullikin J . SSAHA: a fast search method for large DNA databases. Genome Res. 2001; 11(10):1725-9. PMC: 311141. DOI: 10.1101/gr.194201. View

Li H, Durbin R . Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14):1754-60. PMC: 2705234. DOI: 10.1093/bioinformatics/btp324. View

Wang L, Jiang T . On the complexity of multiple sequence alignment. J Comput Biol. 1994; 1(4):337-48. DOI: 10.1089/cmb.1994.1.337. View

Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer R . Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res. 2007; 17(8):1195-201. PMC: 1933516. DOI: 10.1101/gr.6468307. View

Muller T, Spang R, Vingron M . Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2001; 19(1):8-13. DOI: 10.1093/oxfordjournals.molbev.a003985. View

Smith A, Xuan Z, Zhang M . Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008; 9:128. PMC: 2335322. DOI: 10.1186/1471-2105-9-128. View

Rumble S, Lacroute P, Dalca A, Fiume M, Sidow A, Brudno M . SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol. 2009; 5(5):e1000386. PMC: 2678294. DOI: 10.1371/journal.pcbi.1000386. View

Liu Z, Lozupone C, Hamady M, Bushman F, Knight R . Short pyrosequencing reads suffice for accurate microbial community analysis. Nucleic Acids Res. 2007; 35(18):e120. PMC: 2094085. DOI: 10.1093/nar/gkm541. View

Hutchison 3rd C . DNA sequencing: bench to bedside and beyond. Nucleic Acids Res. 2007; 35(18):6227-37. PMC: 2094077. DOI: 10.1093/nar/gkm688. View

10.

Zeller G, Clark R, Schneeberger K, Bohlen A, Weigel D, Ratsch G . Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays. Genome Res. 2008; 18(6):918-29. PMC: 2413159. DOI: 10.1101/gr.070169.107. View

11.

Edgar R . MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004; 5:113. PMC: 517706. DOI: 10.1186/1471-2105-5-113. View

12.

Clark R, Schweikert G, Toomajian C, Ossowski S, Zeller G, Shinn P . Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007; 317(5836):338-42. DOI: 10.1126/science.1138632. View

13.

Lin H, Zhang Z, Zhang M, Ma B, Li M . ZOOM! Zillions of oligos mapped. Bioinformatics. 2008; 24(21):2431-7. PMC: 2732274. DOI: 10.1093/bioinformatics/btn416. View

14.

Thompson J, Higgins D, Gibson T . CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994; 22(22):4673-80. PMC: 308517. DOI: 10.1093/nar/22.22.4673. View

15.

Edgar R . MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792-7. PMC: 390337. DOI: 10.1093/nar/gkh340. View

16.

Jones D, Taylor W, Thornton J . The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992; 8(3):275-82. DOI: 10.1093/bioinformatics/8.3.275. View

17.

Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O . Simultaneous alignment of short reads against multiple genomes. Genome Biol. 2009; 10(9):R98. PMC: 2768987. DOI: 10.1186/gb-2009-10-9-r98. View

18.

Weese D, Emde A, Rausch T, Doring A, Reinert K . RazerS--fast read mapping with sensitivity control. Genome Res. 2009; 19(9):1646-54. PMC: 2752123. DOI: 10.1101/gr.088823.108. View

19.

Langmead B, Trapnell C, Pop M, Salzberg S . Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):R25. PMC: 2690996. DOI: 10.1186/gb-2009-10-3-r25. View

20.

Katoh K, Misawa K, Kuma K, Miyata T . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30(14):3059-66. PMC: 135756. DOI: 10.1093/nar/gkf436. View