» Articles » PMID: 33538820

DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-language in Genome

Overview
Journal Bioinformatics
Specialty Biology
Date 2021 Feb 4
PMID 33538820
Citations 213
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.

Results: To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.

Availability And Implementation: The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model.

Sereshki S, Lonardi S Brief Bioinform. 2025; 26(2).

PMID: 40079264 PMC: 11904404. DOI: 10.1093/bib/bbaf092.


Foundation models in bioinformatics.

Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C Natl Sci Rev. 2025; 12(4):nwaf028.

PMID: 40078374 PMC: 11900445. DOI: 10.1093/nsr/nwaf028.


A Feature Engineering Method for Whole-Genome DNA Sequence with Nucleotide Resolution.

Wang T, Cui Y, Sun T, Li H, Wang C, Hou Y Int J Mol Sci. 2025; 26(5).

PMID: 40076901 PMC: 11899767. DOI: 10.3390/ijms26052281.


Arabidopsis research in 2030: Translating the computable plant.

Brady S, Auge G, Ayalew M, Balasubramanian S, Hamann T, Inze D Plant J. 2025; 121(5):e70047.

PMID: 40028766 PMC: 11874203. DOI: 10.1111/tpj.70047.


Decoding the effects of mutation on protein interactions using machine learning.

Xu W, Li A, Zhao Y, Peng Y Biophys Rev (Melville). 2025; 6(1):011307.

PMID: 40013003 PMC: 11857871. DOI: 10.1063/5.0249920.


References
1.
Solovyev V, Kosarev P, Seledsov I, Vorobyev D . Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006; 7 Suppl 1:S10.1-12. PMC: 1810547. DOI: 10.1186/gb-2006-7-s1-s10. View

2.
Searls D . The language of genes. Nature. 2002; 420(6912):211-7. DOI: 10.1038/nature01255. View

3.
Mantegna R, Buldyrev S, Goldberger A, Havlin S, Peng C, Simons M . Linguistic features of noncoding DNA sequences. Phys Rev Lett. 1994; 73(23):3169-72. DOI: 10.1103/PhysRevLett.73.3169. View

4.
Liang X, Yan D, Zhao J, Ding W, Xu X, Wang X . Interaction of polymorphisms in xeroderma pigmentosum group C with cigarette smoking and pancreatic cancer risk. Oncol Lett. 2018; 16(5):5631-5638. PMC: 6176251. DOI: 10.3892/ol.2018.9350. View

5.
Gerstberger S, Hafner M, Tuschl T . A census of human RNA-binding proteins. Nat Rev Genet. 2014; 15(12):829-45. PMC: 11148870. DOI: 10.1038/nrg3813. View