» Articles » PMID: 33135044

RNANet: an Automatically Built Dual-source Dataset Integrating Homologous Sequences and RNA Structures

Overview
Journal Bioinformatics
Specialty Biology
Date 2020 Nov 2
PMID 33135044
Citations 7
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Applied research in machine learning progresses faster when a clean dataset is available and ready to use. Several datasets have been proposed and released over the years for specific tasks such as image classification, speech-recognition and more recently for protein structure prediction. However, for the fundamental problem of RNA structure prediction, information is spread between several databases depending on the level we are interested in: sequence, secondary structure, 3D structure or interactions with other macromolecules. In order to speed-up advances in machine-learning based approaches for RNA secondary and/or 3D structure prediction, a dataset integrating all this information is required, to avoid spending time on data gathering and cleaning.

Results: Here, we propose the first attempt of a standardized and automatically generated dataset dedicated to RNA combining together: RNA sequences, homology information (under the form of position-specific scoring matrices) and information derived by annotation of available 3D structures (including secondary structure, canonical and non-canonical interactions and backbone torsion angles). The data are retrieved from public databases PDB, Rfam and SILVA. The paper describes the procedure to build such dataset and the RNA structure descriptors we provide. Some statistical descriptions of the resulting dataset are also provided.

Availability And Implementation: The dataset is updated every month and available online (in flat-text file format) on the EvryRNA software platform (https://evryrna.ibisc.univ-evry.fr/evryrna/rnanet). An efficient parallel pipeline to build the dataset is also provided for easy reproduction or modification.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Citing Articles

Has AlphaFold3 achieved success for RNA?.

Bernard C, Postic G, Ghannay S, Tahi F Acta Crystallogr D Struct Biol. 2025; 81(Pt 2):49-62.

PMID: 39868559 PMC: 11804252. DOI: 10.1107/S2059798325000592.


sincFold: end-to-end learning of short- and long-range interactions in RNA secondary structure.

Bugnon L, Di Persia L, Gerard M, Raad J, Prochetto S, Fenoy E Brief Bioinform. 2024; 25(4).

PMID: 38855913 PMC: 11163250. DOI: 10.1093/bib/bbae271.


RNA3DB: A structurally-dissimilar dataset split for training and benchmarking deep learning models for RNA structure prediction.

Szikszai M, Magnus M, Sanghi S, Kadyan S, Bouatta N, Rivas E J Mol Biol. 2024; 436(17):168552.

PMID: 38552946 PMC: 11377173. DOI: 10.1016/j.jmb.2024.168552.


Shining a spotlight on m6A and the vital role of RNA modification in endometrial cancer: a review.

Jin Z, Sheng J, Hu Y, Zhang Y, Wang X, Huang Y Front Genet. 2023; 14:1247309.

PMID: 37886684 PMC: 10598767. DOI: 10.3389/fgene.2023.1247309.


cgRNASP: coarse-grained statistical potentials with residue separation for RNA structure evaluation.

Tan Y, Wang X, Yu S, Zhang B, Tan Z NAR Genom Bioinform. 2023; 5(1):lqad016.

PMID: 36879898 PMC: 9985339. DOI: 10.1093/nargab/lqad016.


References
1.
Duarte C, Pyle A . Stepping through an RNA structure: A novel approach to conformational analysis. J Mol Biol. 1999; 284(5):1465-78. DOI: 10.1006/jmbi.1998.2233. View

2.
Keating K, Humphris E, Pyle A . A new way to see RNA. Q Rev Biophys. 2011; 44(4):433-66. PMC: 4410278. DOI: 10.1017/S0033583511000059. View

3.
Steinegger M, Soding J . MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35(11):1026-1028. DOI: 10.1038/nbt.3988. View

4.
Pruesse E, Peplies J, Glockner F . SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics. 2012; 28(14):1823-9. PMC: 3389763. DOI: 10.1093/bioinformatics/bts252. View

5.
Pruesse E, Quast C, Knittel K, Fuchs B, Ludwig W, Peplies J . SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007; 35(21):7188-96. PMC: 2175337. DOI: 10.1093/nar/gkm864. View