» Articles » PMID: 38552946

RNA3DB: A Structurally-dissimilar Dataset Split for Training and Benchmarking Deep Learning Models for RNA Structure Prediction

Overview
Journal J Mol Biol
Publisher Elsevier
Date 2024 Mar 29
PMID 38552946
Authors
Affiliations
Soon will be listed here.
Abstract

With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.

Citing Articles

RNAbpFlow: Base pair-augmented SE(3)-flow matching for conditional RNA 3D structure generation.

Tarafder S, Bhattacharya D bioRxiv. 2025; .

PMID: 39896539 PMC: 11785242. DOI: 10.1101/2025.01.24.634669.


Has AlphaFold3 achieved success for RNA?.

Bernard C, Postic G, Ghannay S, Tahi F Acta Crystallogr D Struct Biol. 2025; 81(Pt 2):49-62.

PMID: 39868559 PMC: 11804252. DOI: 10.1107/S2059798325000592.


Robust RNA secondary structure prediction with a mixture of deep learning and physics-based experts.

Qiu X Biol Methods Protoc. 2025; 10(1):bpae097.

PMID: 39811444 PMC: 11729747. DOI: 10.1093/biomethods/bpae097.


RNA Structure: Past, Future, and Gene Therapy Applications.

Haseltine W, Hazel K, Patarca R Int J Mol Sci. 2025; 26(1.

PMID: 39795966 PMC: 11719923. DOI: 10.3390/ijms26010110.


Systematic benchmarking of deep-learning methods for tertiary RNA structure prediction.

Bahai A, Keong Kwoh C, Mu Y, Li Y PLoS Comput Biol. 2025; 20(12):e1012715.

PMID: 39775239 PMC: 11723642. DOI: 10.1371/journal.pcbi.1012715.


References
1.
Andronescu M, Condon A, Hoos H, Mathews D, Murphy K . Computational approaches for RNA energy parameter estimation. RNA. 2010; 16(12):2304-18. PMC: 2995392. DOI: 10.1261/rna.1950510. View

2.
Zhu Y, Zhu L, Wang X, Jin H . RNA-based therapeutics: an overview and prospectus. Cell Death Dis. 2022; 13(7):644. PMC: 9308039. DOI: 10.1038/s41419-022-05075-2. View

3.
Qiu X . Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction. PLoS Comput Biol. 2023; 19(4):e1011047. PMC: 10138783. DOI: 10.1371/journal.pcbi.1011047. View

4.
Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J . Critical assessment of methods of protein structure prediction (CASP)-Round XV. Proteins. 2023; 91(12):1539-1549. PMC: 10843301. DOI: 10.1002/prot.26617. View

5.
Popenda M, Szachniuk M, Blazewicz M, Wasik S, Burke E, Blazewicz J . RNA FRABASE 2.0: an advanced web-accessible database with the capacity to search the three-dimensional fragments within RNA structures. BMC Bioinformatics. 2010; 11:231. PMC: 2873543. DOI: 10.1186/1471-2105-11-231. View