» Articles » PMID: 40038307

Breast Cancer Prediction Based on Gene Expression Data Using Interpretable Machine Learning Techniques

Overview
Journal Sci Rep
Specialty Science
Date 2025 Mar 4
PMID 40038307
Authors
Affiliations
Soon will be listed here.
Abstract

Breast cancer remains a global health burden, with an increase in deaths related to this particular cancer. Accurately predicting and diagnosing breast cancer is important for treatment development and survival of patients. This study aimed to accurately predict breast cancer using a dataset comprising 1208 observations and 3602 genes. The study employed feature selection techniques to identify the most influential predictive genes for breast cancer using machine learning (ML) models. The study used K-nearest Neighbors (KNN), random forests (RF), and a support vector machine (SVM). Furthermore, the study employed feature- and model-based importance and explainable ML methods, including Shapley values, Partial dependency (PDPS), and Accumulated Local Effects (ALE) plots, to explain the genes' importance ranking from the ML methods. Shapley values highlighted the significance of some of the genes in predicting cancer presence. Model-based feature ranking techniques, particularly the Leaving-One-Covariate-In (LOCI) method, identified the ten most critical genes for predicting tumor cases. The LOCI rankings from the SVM and RF methods were aligned. Additionally, visualization methods such as PDPS and ALE plots demonstrated how individual feature changes affect predictions and interactions with other genes. By combining feature selection techniques and explainable ML methods, this study has demonstrated the interpretability and reliability of machine learning models for breast cancer prediction, emphasizing the importance of incorporating explainable ML approaches for medical decision-making.

References
1.
Hu B . High-throughput technologies for gene expression analyses: what we have learned for noise-induced cochlear degeneration?. J Otol. 2015; 8(1):25-31. PMC: 4520423. DOI: 10.1016/S1672-2930(13)50003-1. View

2.
Johnson W, Li C, Rabinovic A . Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2006; 8(1):118-27. DOI: 10.1093/biostatistics/kxj037. View

3.
Bao Y, Wang L, Shi L, Yun F, Liu X, Chen Y . Transcriptome profiling revealed multiple genes and ECM-receptor interaction pathways that may be associated with breast cancer. Cell Mol Biol Lett. 2019; 24:38. PMC: 6554968. DOI: 10.1186/s11658-019-0162-0. View

4.
Hijazi H, Chan C . A classification framework applied to cancer gene expression profiles. J Healthc Eng. 2013; 4(2):255-83. PMC: 3873740. DOI: 10.1260/2040-2295.4.2.255. View

5.
Zhao E, Gao K, Xiong J, Liu Z, Chen Y, Yi L . The roles of FXYD family members in ovarian cancer: an integrated analysis by mining TCGA and GEO databases and functional validations. J Cancer Res Clin Oncol. 2023; 149(19):17269-17284. DOI: 10.1007/s00432-023-05445-z. View