Pairwise Heuristic Sequence Alignment Algorithm Based on Deep Reinforcement Learning

Overview

Journal IEEE Open J Eng Med Biol

Publisher IEEE

Specialty Biomedical Engineering

Date 2022 Apr 11

PMID 35402983

Authors

Yong-Joon Song

Dong Jin Ji

Hyein Seo

Gyu-Bum Han

Dong-Ho Cho

Affiliations

Soon will be listed here.

Abstract

Various methods have been developed to analyze the association between organisms and their genomic sequences. Among them, sequence alignment is the most frequently used method for comparative analysis of biological genomes. We intend to propose a novel pairwise sequence alignment method using deep reinforcement learning to break out the old pairwise alignment algorithms. We defined the environment and agent to enable reinforcement learning in the sequence alignment system. This novel method, named DQNalign, can immediately determine the next direction by observing the subsequences within the moving window. DQNalign shows superiority in the dissimilar sequence pairs that have low identity values. And theoretically, we confirm that DQNalign has a low dimension for the sequence length in view of the complexity. This research shows the application method of deep reinforcement learning to the sequence alignment system and how deep reinforcement learning can improve the conventional sequence alignment method.

Citing Articles

Deep reinforcement learning-based pairwise DNA sequence alignment method compatible with embedded edge devices.

Lall A, Tallur S Sci Rep. 2023; 13(1):2773.

PMID: 36797269 PMC: 9935504. DOI: 10.1038/s41598-023-29277-6.

learnMSA: learning and aligning large protein families.

Becker F, Stanke M Gigascience. 2022; 11.

PMID: 36399060 PMC: 9673500. DOI: 10.1093/gigascience/giac104.

Heuristic Pairwise Alignment in Database Environments.

Liptak P, Kiss A, Szalai-Gindl J Genes (Basel). 2022; 13(11).

PMID: 36360242 PMC: 9690874. DOI: 10.3390/genes13112005.

Local Alignment of DNA Sequence Based on Deep Reinforcement Learning.

Song Y, Cho D IEEE Open J Eng Med Biol. 2022; 2:170-178.

PMID: 35402982 PMC: 8975175. DOI: 10.1109/OJEMB.2021.3076156.

References

Chao K, Pearson W, Miller W . Aligning two sequences within a specified diagonal band. Comput Appl Biosci. 1992; 8(5):481-7. DOI: 10.1093/bioinformatics/8.5.481. View

Tang J, Hua K, Chen M, Zhang R, Xie X . A novel k-word relative measure for sequence comparison. Comput Biol Chem. 2014; 53PB:331-338. DOI: 10.1016/j.compbiolchem.2014.10.007. View

Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama K . Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 2001; 8(1):11-22. DOI: 10.1093/dnares/8.1.11. View

NEEDLEMAN S, Wunsch C . A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443-53. DOI: 10.1016/0022-2836(70)90057-4. View

Wolfsheimer S, Burghardt B, Hartmann A . Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algorithms Mol Biol. 2007; 2:9. PMC: 1945026. DOI: 10.1186/1748-7188-2-9. View

Chen X, Li X, Wang P, Liu Y, Zhang Z, Zhao G . Novel association strategy with copy number variation for identifying new risk Loci of human diseases. PLoS One. 2010; 5(8):e12185. PMC: 2924882. DOI: 10.1371/journal.pone.0012185. View

Marcais G, Delcher A, Phillippy A, Coston R, Salzberg S, Zimin A . MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol. 2018; 14(1):e1005944. PMC: 5802927. DOI: 10.1371/journal.pcbi.1005944. View

Schuster S . Next-generation sequencing transforms today's biology. Nat Methods. 2008; 5(1):16-8. DOI: 10.1038/nmeth1156. View

Pang H, Tang J, Chen S, Tao S . Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinformatics. 2005; 6:257. PMC: 1276786. DOI: 10.1186/1471-2105-6-257. View

10.

Mott R . Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull Math Biol. 2015; 54(1):59-75. DOI: 10.1007/BF02458620. View

11.

Wang L, Jiang T . On the complexity of multiple sequence alignment. J Comput Biol. 1994; 1(4):337-48. DOI: 10.1089/cmb.1994.1.337. View

12.

Ford D, Easton D, Bishop D, Narod S, Goldgar D . Risks of cancer in BRCA1-mutation carriers. Breast Cancer Linkage Consortium. Lancet. 1994; 343(8899):692-5. DOI: 10.1016/s0140-6736(94)91578-4. View

13.

Sievers F, Wilm A, Dineen D, Gibson T, Karplus K, Li W . Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7:539. PMC: 3261699. DOI: 10.1038/msb.2011.75. View

14.

Katoh K, Standley D . MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30(4):772-80. PMC: 3603318. DOI: 10.1093/molbev/mst010. View

15.

Patki M, Chari V, Sivakumaran S, Gonit M, Trumbly R, Ratnam M . The ETS domain transcription factor ELK1 directs a critical component of growth signaling by the androgen receptor in prostate cancer cells. J Biol Chem. 2013; 288(16):11047-65. PMC: 3630885. DOI: 10.1074/jbc.M112.438473. View

16.

Qian B, Goldstein R . Distribution of Indel lengths. Proteins. 2001; 45(1):102-4. DOI: 10.1002/prot.1129. View

17.

Sarsani V, Raghupathy N, Fiddes I, Armstrong J, Thibaud-Nissen F, Zinder O . The Genome of C57BL/6J "Eve", the Mother of the Laboratory Mouse Genome Reference Strain. G3 (Bethesda). 2019; 9(6):1795-1805. PMC: 6553538. DOI: 10.1534/g3.119.400071. View

18.

Tutaj M, Smith J, Bolton E . Rat Genome Assemblies, Annotation, and Variant Repository. Methods Mol Biol. 2019; 2018:43-70. DOI: 10.1007/978-1-4939-9581-3_2. View

19.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K . BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10:421. PMC: 2803857. DOI: 10.1186/1471-2105-10-421. View

20.

Edgar R . MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792-7. PMC: 390337. DOI: 10.1093/nar/gkh340. View