» Articles » PMID: 39296930

Evaluation of Machine Learning Models That Predict LncRNA Subcellular Localization

Overview
Specialty Biology
Date 2024 Sep 19
PMID 39296930
Authors
Affiliations
Soon will be listed here.
Abstract

The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, . 72-74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this 'middle exclusion' protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem.

References
1.
Zeng M, Wu Y, Lu C, Zhang F, Wu F, Li M . DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Brief Bioinform. 2021; 23(1). DOI: 10.1093/bib/bbab360. View

2.
de Souza N . The ENCODE project. Nat Methods. 2013; 9(11):1046. DOI: 10.1038/nmeth.2238. View

3.
Zuckerman B, Ulitsky I . Predictive models of subcellular localization of long RNAs. RNA. 2019; 25(5):557-572. PMC: 6467007. DOI: 10.1261/rna.068288.118. View

4.
Cui T, Dou Y, Tan P, Ni Z, Liu T, Wang D . RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic Acids Res. 2021; 50(D1):D333-D339. PMC: 8728251. DOI: 10.1093/nar/gkab825. View

5.
Asim M, Ibrahim M, Malik M, Zehe C, Cloarec O, Trygg J . EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction. Comput Struct Biotechnol J. 2022; 20:3986-4002. PMC: 9356161. DOI: 10.1016/j.csbj.2022.07.031. View