RNA3DB: A Structurally-dissimilar Dataset Split for Training and Benchmarking Deep Learning Models for RNA Structure Prediction
Overview
Molecular Biology
Affiliations
With advances in protein structure prediction thanks to deep learning models like AlphaFold, RNA structure prediction has recently received increased attention from deep learning researchers. RNAs introduce substantial challenges due to the sparser availability and lower structural diversity of the experimentally resolved RNA structures in comparison to protein structures. These challenges are often poorly addressed by the existing literature, many of which report inflated performance due to using training and testing sets with significant structural overlap. Further, the most recent Critical Assessment of Structure Prediction (CASP15) has shown that deep learning models for RNA structure are currently outperformed by traditional methods. In this paper we present RNA3DB, a dataset of structured RNAs, derived from the Protein Data Bank (PDB), that is designed for training and benchmarking deep learning models. The RNA3DB method arranges the RNA 3D chains into distinct groups (Components) that are non-redundant both with regard to sequence as well as structure, providing a robust way of dividing training, validation, and testing sets. Any split of these structurally-dissimilar Components are guaranteed to produce test and validations sets that are distinct by sequence and structure from those in the training set. We provide the RNA3DB dataset, a particular train/test split of the RNA3DB Components (in an approximate 70/30 ratio) that will be updated periodically. We also provide the RNA3DB methodology along with the source-code, with the goal of creating a reproducible and customizable tool for producing structurally-dissimilar dataset splits for structural RNAs.
RNAbpFlow: Base pair-augmented SE(3)-flow matching for conditional RNA 3D structure generation.
Tarafder S, Bhattacharya D bioRxiv. 2025; .
PMID: 39896539 PMC: 11785242. DOI: 10.1101/2025.01.24.634669.
Has AlphaFold3 achieved success for RNA?.
Bernard C, Postic G, Ghannay S, Tahi F Acta Crystallogr D Struct Biol. 2025; 81(Pt 2):49-62.
PMID: 39868559 PMC: 11804252. DOI: 10.1107/S2059798325000592.
Robust RNA secondary structure prediction with a mixture of deep learning and physics-based experts.
Qiu X Biol Methods Protoc. 2025; 10(1):bpae097.
PMID: 39811444 PMC: 11729747. DOI: 10.1093/biomethods/bpae097.
RNA Structure: Past, Future, and Gene Therapy Applications.
Haseltine W, Hazel K, Patarca R Int J Mol Sci. 2025; 26(1.
PMID: 39795966 PMC: 11719923. DOI: 10.3390/ijms26010110.
Systematic benchmarking of deep-learning methods for tertiary RNA structure prediction.
Bahai A, Keong Kwoh C, Mu Y, Li Y PLoS Comput Biol. 2025; 20(12):e1012715.
PMID: 39775239 PMC: 11723642. DOI: 10.1371/journal.pcbi.1012715.