» Articles » PMID: 33213499

A Pitfall for Machine Learning Methods Aiming to Predict Across Cell Types

Overview
Journal Genome Biol
Specialties Biology
Genetics
Date 2020 Nov 20
PMID 33213499
Citations 29
Authors
Affiliations
Soon will be listed here.
Abstract

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.

Citing Articles

Iterative improvement of deep learning models using synthetic regulatory genomics.

Ribeiro-Dos-Santos A, Maurano M bioRxiv. 2025; .

PMID: 39974895 PMC: 11838587. DOI: 10.1101/2025.02.04.636130.


Loss of MEF2C function by enhancer mutation leads to neuronal mitochondria dysfunction and motor deficits in mice.

Yousefian-Jazi A, Kim S, Chu J, Choi S, Nguyen P, Park U Mol Neurodegener. 2025; 20(1):16.

PMID: 39920775 PMC: 11806887. DOI: 10.1186/s13024-024-00792-y.


Generative modeling for RNA splicing predictions and design.

Wu D, Maus N, Jha A, Yang K, Wales-McGrath B, Jewell S bioRxiv. 2025; .

PMID: 39896553 PMC: 11785043. DOI: 10.1101/2025.01.20.633986.


Best holdout assessment is sufficient for cancer transcriptomic model selection.

Crawford J, Chikina M, Greene C Patterns (N Y). 2025; 5(12):101115.

PMID: 39776849 PMC: 11701843. DOI: 10.1016/j.patter.2024.101115.


Predicting cell type-specific epigenomic profiles accounting for distal genetic effects.

Murphy A, Beardall W, Rei M, Phuycharoen M, Skene N Nat Commun. 2024; 15(1):9951.

PMID: 39550354 PMC: 11569248. DOI: 10.1038/s41467-024-54441-5.


References
1.
Nair S, Kim D, Perricone J, Kundaje A . Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics. 2019; 35(14):i108-i116. PMC: 6612838. DOI: 10.1093/bioinformatics/btz352. View

2.
Li Y, Shi W, Wasserman W . Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics. 2018; 19(1):202. PMC: 5984344. DOI: 10.1186/s12859-018-2187-1. View

3.
Thibodeau A, Uyar A, Khetan S, Stitzel M, Ucar D . A neural network based model effectively predicts enhancers from clinical ATAC-seq samples. Sci Rep. 2018; 8(1):16048. PMC: 6207744. DOI: 10.1038/s41598-018-34420-9. View

4.
Erwin G, Oksenberg N, Truty R, Kostka D, Murphy K, Ahituv N . Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput Biol. 2014; 10(6):e1003677. PMC: 4072507. DOI: 10.1371/journal.pcbi.1003677. View

5.
Singh R, Lanchantin J, Robins G, Qi Y . DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics. 2016; 32(17):i639-i648. DOI: 10.1093/bioinformatics/btw427. View