» Articles » PMID: 39352899

GenerRNA: A Generative Pre-trained Language Model for De Novo RNA Design

Overview
Journal PLoS One
Date 2024 Oct 1
PMID 39352899
Authors
Affiliations
Soon will be listed here.
Abstract

The design of RNA plays a crucial role in developing RNA vaccines, nucleic acid therapeutics, and innovative biotechnological tools. However, existing techniques frequently lack versatility across various tasks and are dependent on pre-defined secondary structure or other prior knowledge. To address these limitations, we introduce GenerRNA, a Transformer-based model inspired by the success of large language models (LLMs) in protein and molecule generation. GenerRNA is pre-trained on large-scale RNA sequences and capable of generating novel RNA sequences with stable secondary structures, while ensuring distinctiveness from existing sequences, thereby expanding our exploration of the RNA space. Moreover, GenerRNA can be fine-tuned on smaller, specialized datasets for specific subtasks, enabling the generation of RNAs with desired functionalities or properties without requiring any prior knowledge input. As a demonstration, we fine-tuned GenerRNA and successfully generated novel RNA sequences exhibiting high affinity for target proteins. Our work is the first application of a generative language model to RNA generation, presenting an innovative approach to RNA design.

Citing Articles

Foundation models in bioinformatics.

Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C Natl Sci Rev. 2025; 12(4):nwaf028.

PMID: 40078374 PMC: 11900445. DOI: 10.1093/nsr/nwaf028.


RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models.

Asim M, Ibrahim M, Asif T, Dengel A Heliyon. 2025; 11(2):e41488.

PMID: 39897847 PMC: 11783440. DOI: 10.1016/j.heliyon.2024.e41488.


RNA language models predict mutations that improve RNA function.

Shulgina Y, Trinidad M, Langeberg C, Nisonoff H, Chithrananda S, Skopintsev P Nat Commun. 2024; 15(1):10627.

PMID: 39638800 PMC: 11621547. DOI: 10.1038/s41467-024-54812-y.

References
1.
Li W, Godzik A . Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22(13):1658-9. DOI: 10.1093/bioinformatics/btl158. View

2.
Popenda M, Szachniuk M, Antczak M, Purzycka K, Lukasiak P, Bartol N . Automated 3D structure composition for large RNAs. Nucleic Acids Res. 2012; 40(14):e112. PMC: 3413140. DOI: 10.1093/nar/gks339. View

3.
Sanford J, Wang X, Mort M, VanDuyn N, Cooper D, Mooney S . Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res. 2009; 19(3):381-94. PMC: 2661799. DOI: 10.1101/gr.082503.108. View

4.
Van Nostrand E, Pratt G, Shishkin A, Gelboin-Burkhart C, Fang M, Sundararaman B . Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat Methods. 2016; 13(6):508-14. PMC: 4887338. DOI: 10.1038/nmeth.3810. View

5.
Churkin A, Retwitzer M, Reinharz V, Ponty Y, Waldispuhl J, Barash D . Design of RNAs: comparing programs for inverse RNA folding. Brief Bioinform. 2017; 19(2):350-358. PMC: 6018860. DOI: 10.1093/bib/bbw120. View