QuickProbs 2: Towards Rapid Construction of High-quality Alignments of Large Protein Families

Overview

Journal Sci Rep

Specialty Science

Date 2017 Feb 1

PMID 28139687

Citations 6

Authors

Adam Gudys

Sebastian Deorowicz

Affiliations

Soon will be listed here.

Abstract

The ever-increasing size of sequence databases caused by the development of high throughput sequencing, poses to multiple alignment algorithms one of the greatest challenges yet. As we show, well-established techniques employed for increasing alignment quality, i.e., refinement and consistency, are ineffective when large protein families are investigated. We present QuickProbs 2, an algorithm for multiple sequence alignment. Based on probabilistic models, equipped with novel column-oriented refinement and selective consistency, it offers outstanding accuracy. When analysing hundreds of sequences, Quick-Probs 2 is noticeably better than ClustalΩ and MAFFT, the previous leaders for processing numerous protein families. In the case of smaller sets, for which consistency-based methods are the best performing, QuickProbs 2 is also superior to the competitors. Due to low computational requirements of selective consistency and utilization of massively parallel architectures, presented algorithm has similar execution times to ClustalΩ, and is orders of magnitude faster than full consistency approaches, like MSAProbs or PicXAA. All these make QuickProbs 2 an excellent tool for aligning families ranging from few, to hundreds of proteins.

Citing Articles

DNA binding and RAD51 engagement by the BRCA2 C-terminus orchestrate DNA repair and replication fork preservation.

Kwon Y, Rosner H, Zhao W, Selemenakis P, He Z, Kawale A Nat Commun. 2023; 14(1):432.

PMID: 36702902 PMC: 9879961. DOI: 10.1038/s41467-023-36211-x.

Spotlight on alternative frame coding: Two long overlapping genes in are translated and under purifying selection.

Kreitmeier M, Ardern Z, Abele M, Ludwig C, Scherer S, Neuhaus K iScience. 2022; 25(2):103844.

PMID: 35198897 PMC: 8850804. DOI: 10.1016/j.isci.2022.103844.

Ecological diversification reveals routes of pathogen emergence in endemic populations.

Lopez-Perez M, Jayakumar J, Grant T, Zaragoza-Solas A, Cabello-Yeves P, Almagro-Moreno S Proc Natl Acad Sci U S A. 2021; 118(40).

PMID: 34593634 PMC: 8501797. DOI: 10.1073/pnas.2103470118.

RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content.

Coutinho F, Zaragoza-Solas A, Lopez-Perez M, Barylski J, Zielezinski A, Dutilh B Patterns (N Y). 2021; 2(7):100274.

PMID: 34286299 PMC: 8276007. DOI: 10.1016/j.patter.2021.100274.

Parallelization of MAFFT for large-scale multiple sequence alignments.

Nakamura T, Yamada K, Tomii K, Katoh K Bioinformatics. 2018; 34(14):2490-2492.

PMID: 29506019 PMC: 6041967. DOI: 10.1093/bioinformatics/bty121.

References

Boyce K, Sievers F, Higgins D . Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments. Proc Natl Acad Sci U S A. 2015; 112(2):E101. PMC: 4299201. DOI: 10.1073/pnas.1419351112. View

Ye Y, Cheung D, Wang Y, Yiu S, Zhan Q, Lam T . GLProbs: Aligning Multiple Sequences Adaptively. IEEE/ACM Trans Comput Biol Bioinform. 2015; 12(1):67-78. DOI: 10.1109/TCBB.2014.2316820. View

Muller T, Spang R, Vingron M . Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2001; 19(1):8-13. DOI: 10.1093/oxfordjournals.molbev.a003985. View

Chakrabarti S, Lanczycki C, Panchenko A, Przytycka T, Thiessen P, Bryant S . Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res. 2006; 34(9):2598-606. PMC: 1463900. DOI: 10.1093/nar/gkl274. View

Mizuguchi K, Deane C, Blundell T, Overington J . HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 1998; 7(11):2469-71. PMC: 2143859. DOI: 10.1002/pro.5560071126. View

Sahraeian S, Yoon B . PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res. 2010; 38(15):4917-28. PMC: 2926610. DOI: 10.1093/nar/gkq255. View

Miyazawa S . A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995; 8(10):999-1009. DOI: 10.1093/protein/8.10.999. View

Yamada K, Tomii K, Katoh K . Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees. Bioinformatics. 2016; 32(21):3246-3251. PMC: 5079479. DOI: 10.1093/bioinformatics/btw412. View

Gudys A, Deorowicz S . QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors. PLoS One. 2014; 9(2):e88901. PMC: 3934876. DOI: 10.1371/journal.pone.0088901. View

10.

Katoh K, Toh H . PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics. 2006; 23(3):372-4. DOI: 10.1093/bioinformatics/btl592. View

11.

Boyce K, Sievers F, Higgins D . Simple chained guide trees give high-quality protein multiple sequence alignments. Proc Natl Acad Sci U S A. 2014; 111(29):10556-61. PMC: 4115562. DOI: 10.1073/pnas.1405628111. View

12.

Roshan U, Livesay D . Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 2006; 22(22):2715-21. DOI: 10.1093/bioinformatics/btl472. View

13.

Loytynoja A, Goldman N . Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008; 320(5883):1632-5. DOI: 10.1126/science.1158395. View

14.

Liu Y, Schmidt B, Maskell D . MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics. 2010; 26(16):1958-64. DOI: 10.1093/bioinformatics/btq338. View

15.

Sievers F, Wilm A, Dineen D, Gibson T, Karplus K, Li W . Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7:539. PMC: 3261699. DOI: 10.1038/msb.2011.75. View

16.

Katoh K, Misawa K, Kuma K, Miyata T . MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002; 30(14):3059-66. PMC: 135756. DOI: 10.1093/nar/gkf436. View

17.

Edgar R . Quality measures for protein alignment benchmarks. Nucleic Acids Res. 2010; 38(7):2145-53. PMC: 2853116. DOI: 10.1093/nar/gkp1196. View

18.

Loytynoja A . Phylogeny-aware alignment with PRANK. Methods Mol Biol. 2013; 1079:155-70. DOI: 10.1007/978-1-62703-646-7_10. View

19.

Notredame C, Higgins D, Heringa J . T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302(1):205-17. DOI: 10.1006/jmbi.2000.4042. View

20.

Blackshields G, Sievers F, Shi W, Wilm A, Higgins D . Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 2010; 5:21. PMC: 2893182. DOI: 10.1186/1748-7188-5-21. View