» Articles » PMID: 32657367

EvoLSTM: Context-dependent Models of Sequence Evolution Using a Sequence-to-sequence LSTM

Overview
Journal Bioinformatics
Specialty Biology
Date 2020 Jul 14
PMID 32657367
Citations 3
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood.

Results: We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes.

Availability And Implementation: Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Insertions and Deletions: Computational Methods, Evolutionary Dynamics, and Biological Applications.

Redelings B, Holmes I, Lunter G, Pupko T, Anisimova M Mol Biol Evol. 2024; 41(9).

PMID: 39172750 PMC: 11385596. DOI: 10.1093/molbev/msae177.


Context-Dependent Substitution Dynamics in Plastid DNA Across a Wide Range of Taxonomic Groups.

Morton B J Mol Evol. 2022; 90(1):44-55.

PMID: 35037071 DOI: 10.1007/s00239-021-10040-2.


Attention-Based Deep Multiple-Instance Learning for Classifying Circular RNA and Other Long Non-Coding RNA.

Liu Y, Fu Q, Peng X, Zhu C, Liu G, Liu L Genes (Basel). 2021; 12(12).

PMID: 34946967 PMC: 8701965. DOI: 10.3390/genes12122018.

References
1.
Zhang H, Lang Z, Zhu J . Dynamics and function of DNA methylation in plants. Nat Rev Mol Cell Biol. 2018; 19(8):489-506. DOI: 10.1038/s41580-018-0016-z. View

2.
Blanchette M, Kent W, Riemer C, Elnitski L, Smit A, Roskin K . Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004; 14(4):708-15. PMC: 383317. DOI: 10.1101/gr.1933104. View

3.
Rodrigue N, Lartillot N, Bryant D, Philippe H . Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene. 2005; 347(2):207-17. DOI: 10.1016/j.gene.2004.12.011. View

4.
Siepel A, Haussler D . Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2003; 21(3):468-88. DOI: 10.1093/molbev/msh039. View

5.
Ranwez V, Harispe S, Delsuc F, Douzery E . MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLoS One. 2011; 6(9):e22594. PMC: 3174933. DOI: 10.1371/journal.pone.0022594. View