» Articles » PMID: 39776849

Best Holdout Assessment is Sufficient for Cancer Transcriptomic Model Selection

Overview
Journal Patterns (N Y)
Date 2025 Jan 8
PMID 39776849
Authors
Affiliations
Soon will be listed here.
Abstract

Guidelines in statistical modeling for genomics hold that simpler models have advantages over more complex ones. Potential advantages include cost, interpretability, and improved generalization across datasets or biological contexts. We directly tested the assumption that small gene signatures generalize better by examining the generalization of mutation status prediction models across datasets (from cell lines to human tumors and vice versa) and biological contexts (holding out entire cancer types from pan-cancer data). We compared model selection between solely cross-validation performance and combining cross-validation performance with regularization strength. We did not observe that more regularized signatures generalized better. This result held across both generalization problems and for both linear models (LASSO logistic regression) and non-linear ones (neural networks). When the goal of an analysis is to produce generalizable predictive models, we recommend choosing the ones that perform best on held-out data or in cross-validation instead of those that are smaller or more regularized.

Citing Articles

Model interpretability enhances domain generalization in the case of textual complexity modeling.

van der Sluis F, van den Broek E Patterns (N Y). 2025; 6(2):101177.

PMID: 40041855 PMC: 11873011. DOI: 10.1016/j.patter.2025.101177.

References
1.
Townes F, Hicks S, Aryee M, Irizarry R . Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biol. 2019; 20(1):295. PMC: 6927135. DOI: 10.1186/s13059-019-1861-6. View

2.
Shao F, Wang Z, Wang S . Identification of -Related Gene as a Potential Biomarker for Neuroblastoma Prognostic Model by Integrated Analysis and Quantitative Real-Time PCR. DNA Cell Biol. 2021; 40(2):332-347. DOI: 10.1089/dna.2020.6193. View

3.
Wilding J, Bodmer W . Cancer cell lines for drug discovery and development. Cancer Res. 2014; 74(9):2377-84. DOI: 10.1158/0008-5472.CAN-13-2971. View

4.
Kass R, Caffo B, Davidian M, Meng X, Yu B, Reid N . Ten Simple Rules for Effective Statistical Practice. PLoS Comput Biol. 2016; 12(6):e1004961. PMC: 4900655. DOI: 10.1371/journal.pcbi.1004961. View

5.
da Cunha Santos G, Shepherd F, Tsao M . EGFR mutations and lung cancer. Annu Rev Pathol. 2010; 6:49-69. DOI: 10.1146/annurev-pathol-011110-130206. View