» Articles » PMID: 36972227

The Effect of Non-linear Signal in Classification Problems Using Gene Expression

Overview
Specialty Biology
Date 2023 Mar 27
PMID 36972227
Authors
Affiliations
Soon will be listed here.
Abstract

Those building predictive models from transcriptomic data are faced with two conflicting perspectives. The first, based on the inherent high dimensionality of biological systems, supposes that complex non-linear models such as neural networks will better match complex biological systems. The second, imagining that complex systems will still be well predicted by simple dividing lines prefers linear models that are easier to interpret. We compare multi-layer neural networks and logistic regression across multiple prediction tasks on GTEx and Recount3 datasets and find evidence in favor of both possibilities. We verified the presence of non-linear signal when predicting tissue and metadata sex labels from expression data by removing the predictive linear signal with Limma, and showed the removal ablated the performance of linear methods but not non-linear ones. However, we also found that the presence of non-linear signal was not necessarily sufficient for neural networks to outperform logistic regression. Our results demonstrate that while multi-layer neural networks may be useful for making predictions from gene expression data, including a linear baseline model is critical because while biological systems are high-dimensional, effective dividing lines for predictive models may not be.

Citing Articles

Best holdout assessment is sufficient for cancer transcriptomic model selection.

Crawford J, Chikina M, Greene C Patterns (N Y). 2025; 5(12):101115.

PMID: 39776849 PMC: 11701843. DOI: 10.1016/j.patter.2024.101115.


MousiPLIER: A Mouse Pathway-Level Information Extractor Model.

Zhang S, Heil B, Mao W, Chikina M, Greene C, Heller E eNeuro. 2024; 11(6).

PMID: 38789274 PMC: 11154669. DOI: 10.1523/ENEURO.0313-23.2024.

References
1.
Svensson V, Beltrame E, Pachter L . A curated database reveals trends in single-cell transcriptomics. Database (Oxford). 2020; 2020. PMC: 7698659. DOI: 10.1093/database/baaa073. View

2.
Smith A, Walsh J, Long J, Davis C, Henstock P, Hodge M . Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics. 2020; 21(1):119. PMC: 7085143. DOI: 10.1186/s12859-020-3427-8. View

3.
Harrison P, Wright A, Mank J . The evolution of gene expression and the transcriptome-phenotype relationship. Semin Cell Dev Biol. 2012; 23(2):222-9. PMC: 3378502. DOI: 10.1016/j.semcdb.2011.12.004. View

4.
Wilks C, Zheng S, Chen F, Charles R, Solomon B, Ling J . recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 2021; 22(1):323. PMC: 8628444. DOI: 10.1186/s13059-021-02533-6. View

5.
Christodoulou E, Ma J, Collins G, Steyerberg E, Verbakel J, Van Calster B . A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019; 110:12-22. DOI: 10.1016/j.jclinepi.2019.02.004. View