Multiple Sequence Alignment-based RNA Language Model and Its Application to Structural Inference

Overview

Journal Nucleic Acids Res

Publisher Oxford University Press

Specialty Biochemistry

Date 2023 Nov 9

PMID 37941140

Authors

Yikun Zhang

Mei Lang

Jiuhong Jiang

Zhiqiang Gao

Fan Xu

Thomas Litfin

Ke Chen

Jaswinder Singh

Xiansong Huang

Guoli Song

Yonghong Tian

Jian Zhan

Jie Chen

Yaoqi Zhou

Affiliations

Soon will be listed here.

Abstract

Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.

Citing Articles

Foundation models in bioinformatics.

Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C Natl Sci Rev. 2025; 12(4):nwaf028.

PMID: 40078374 PMC: 11900445. DOI: 10.1093/nsr/nwaf028.

RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.

Asim M, Ibrahim M, Asif T, Dengel A Heliyon. 2025; 11(2):e41488.

PMID: 39897847 PMC: 11783440. DOI: 10.1016/j.heliyon.2024.e41488.

RNAbpFlow: Base pair-augmented SE(3)-flow matching for conditional RNA 3D structure generation.

Tarafder S, Bhattacharya D bioRxiv. 2025; .

PMID: 39896539 PMC: 11785242. DOI: 10.1101/2025.01.24.634669.

Overview and Prospects of DNA Sequence Visualization.

Wu Y, Xie X, Zhu J, Guan L, Li M Int J Mol Sci. 2025; 26(2).

PMID: 39859192 PMC: 11764684. DOI: 10.3390/ijms26020477.

Robust RNA secondary structure prediction with a mixture of deep learning and physics-based experts.

Qiu X Biol Methods Protoc. 2025; 10(1):bpae097.

PMID: 39811444 PMC: 11729747. DOI: 10.1093/biomethods/bpae097.

References

Fu L, Niu B, Zhu Z, Wu S, Li W . CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150-2. PMC: 3516142. DOI: 10.1093/bioinformatics/bts565. View

Steinegger M, Meier M, Mirdita M, Vohringer H, Haunsberger S, Soding J . HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics. 2019; 20(1):473. PMC: 6744700. DOI: 10.1186/s12859-019-3019-7. View

Kalvari I, Nawrocki E, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M . Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 2020; 49(D1):D192-D200. PMC: 7779021. DOI: 10.1093/nar/gkaa1047. View

Alley E, Khimulya G, Biswas S, AlQuraishi M, Church G . Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019; 16(12):1315-1322. PMC: 7067682. DOI: 10.1038/s41592-019-0598-1. View

Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L . ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans Pattern Anal Mach Intell. 2021; 44(10):7112-7127. DOI: 10.1109/TPAMI.2021.3095381. View

Andronescu M, Bereg V, Hoos H, Condon A . RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics. 2008; 9:340. PMC: 2536673. DOI: 10.1186/1471-2105-9-340. View

Strodthoff N, Wagner P, Wenzel M, Samek W . UDSMProt: universal deep sequence models for protein classification. Bioinformatics. 2020; 36(8):2401-2409. PMC: 7178389. DOI: 10.1093/bioinformatics/btaa003. View

. RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res. 2018; 47(D1):D221-D229. PMC: 6324050. DOI: 10.1093/nar/gky1034. View

Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M . ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022; 38(8):2102-2110. PMC: 9386727. DOI: 10.1093/bioinformatics/btac020. View

10.

Zhang H, Zhang L, Mathews D, Huang L . LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities. Bioinformatics. 2020; 36(Suppl_1):i258-i267. PMC: 7355276. DOI: 10.1093/bioinformatics/btaa460. View

11.

Sloma M, Mathews D . Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA. 2016; 22(12):1808-1818. PMC: 5113201. DOI: 10.1261/rna.053694.115. View

12.

Szikszai M, Wise M, Datta A, Ward M, Mathews D . Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics. 2022; 38(16):3892-3899. PMC: 9364374. DOI: 10.1093/bioinformatics/btac415. View

13.

Lu X, Bussemaker H, Olson W . DSSR: an integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res. 2015; 43(21):e142. PMC: 4666379. DOI: 10.1093/nar/gkv716. View

14.

Lorenz R, Bernhart S, Honer Zu Siederdissen C, Tafer H, Flamm C, Stadler P . ViennaRNA Package 2.0. Algorithms Mol Biol. 2011; 6:26. PMC: 3319429. DOI: 10.1186/1748-7188-6-26. View

15.

Suzek B, Huang H, McGarvey P, Mazumder R, Wu C . UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007; 23(10):1282-8. DOI: 10.1093/bioinformatics/btm098. View

16.

Hanumanthappa A, Singh J, Paliwal K, Singh J, Zhou Y . Single-sequence and profile-based prediction of RNA solvent accessibility using dilated convolutional neural network. Bioinformatics. 2020; 36(21):5169-5176. DOI: 10.1093/bioinformatics/btaa652. View

17.

Yi H, You Z, Cheng L, Zhou X, Jiang T, Li X . Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions. Comput Struct Biotechnol J. 2020; 18:20-26. PMC: 6926125. DOI: 10.1016/j.csbj.2019.11.004. View

18.

Menzel P, Gorodkin J, Stadler P . The tedious task of finding homologous noncoding RNA genes. RNA. 2009; 15(12):2075-82. PMC: 2779685. DOI: 10.1261/rna.1556009. View

19.

Singh J, Hanson J, Paliwal K, Zhou Y . RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun. 2019; 10(1):5407. PMC: 6881452. DOI: 10.1038/s41467-019-13395-9. View

20.

Mitchell A, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G . MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 2019; 48(D1):D570-D578. PMC: 7145632. DOI: 10.1093/nar/gkz1035. View