» Articles » PMID: 26213851

Predicting the Sequence Specificities of DNA- and RNA-binding Proteins by Deep Learning

Overview
Journal Nat Biotechnol
Specialty Biotechnology
Date 2015 Jul 28
PMID 26213851
Citations 911
Authors
Affiliations
Soon will be listed here.
Abstract

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

Citing Articles

RBPsuite 2.0: an updated RNA-protein binding site prediction suite with high coverage on species and proteins based on deep learning.

Pan X, Fang Y, Liu X, Guo X, Shen H BMC Biol. 2025; 23(1):74.

PMID: 40069726 PMC: 11899677. DOI: 10.1186/s12915-025-02182-2.


Precise engineering of gene expression by editing plasticity.

Qiu Y, Liu L, Yan J, Xiang X, Wang S, Luo Y Genome Biol. 2025; 26(1):51.

PMID: 40065399 PMC: 11892124. DOI: 10.1186/s13059-025-03516-7.


Pathways to chronic disease detection and prediction: Mapping the potential of machine learning to the pathophysiological processes while navigating ethical challenges.

Afrifa-Yamoah E, Adua E, Peprah-Yamoah E, Anto E, Opoku-Yamoah V, Acheampong E Chronic Dis Transl Med. 2025; 11(1):1-21.

PMID: 40051825 PMC: 11880127. DOI: 10.1002/cdt3.137.


Enhancer reprogramming: critical roles in cancer and promising therapeutic strategies.

Yang J, Zhou F, Luo X, Fang Y, Wang X, Liu X Cell Death Discov. 2025; 11(1):84.

PMID: 40032852 PMC: 11876437. DOI: 10.1038/s41420-025-02366-3.


Inferring protein from transcript abundances using convolutional neural networks.

Schwehn P, Falter-Braun P BioData Min. 2025; 18(1):18.

PMID: 40016737 PMC: 11866710. DOI: 10.1186/s13040-025-00434-z.


References
1.
Kharchenko P, Tolstorukov M, Park P . Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008; 26(12):1351-9. PMC: 2597701. DOI: 10.1038/nbt.1508. View

2.
Stormo G . DNA binding sites: representation and discovery. Bioinformatics. 2000; 16(1):16-23. DOI: 10.1093/bioinformatics/16.1.16. View

3.
Bae B, Tietjen I, Atabay K, Evrony G, Johnson M, Asare E . Evolutionarily dynamic alternative splicing of GPR56 regulates regional cerebral cortical patterning. Science. 2014; 343(6172):764-8. PMC: 4480613. DOI: 10.1126/science.1244392. View

4.
Berger M, Philippakis A, Qureshi A, He F, Estep 3rd P, Bulyk M . Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol. 2006; 24(11):1429-35. PMC: 4419707. DOI: 10.1038/nbt1246. View

5.
Chen X, Hughes T, Morris Q . RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics. 2007; 23(13):i72-9. DOI: 10.1093/bioinformatics/btm224. View