Processing Imbalanced Medical Data at the Data Level with Assisted-reproduction Data As an Example

Overview

Journal BioData Min

Publisher Biomed Central

Specialty Biology

Date 2024 Sep 5

PMID 39232851

Authors

Junliang Zhu

Shaowei Pu

Jiaji He

Dongchao Su

Weijie Cai

Xueying Xu

Hongbo Liu

Affiliations

Soon will be listed here.

Abstract

Objective: Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.

Methods: We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.

Results: The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.

Conclusions: The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.

References

Dablain D, Krawczyk B, Chawla N . DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Trans Neural Netw Learn Syst. 2022; 34(9):6390-6404. DOI: 10.1109/TNNLS.2021.3136503. View

Beam A, Kohane I . Big Data and Machine Learning in Health Care. JAMA. 2018; 319(13):1317-1318. DOI: 10.1001/jama.2017.18391. View

Ahsan M, Siddique Z . Machine learning-based heart disease diagnosis: A systematic literature review. Artif Intell Med. 2022; 128:102289. DOI: 10.1016/j.artmed.2022.102289. View

Ren Y, Wu D, Tong Y, Lopez-Defede A, Gareau S . Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies. J Med Internet Res. 2023; 25:e44081. PMC: 10267797. DOI: 10.2196/44081. View

Kim K, Sohn S . Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw. 2020; 130:176-184. DOI: 10.1016/j.neunet.2020.06.026. View

Drouard G, Mykkanen J, Heiskanen J, Pohjonen J, Ruohonen S, Pahkala K . Exploring machine learning strategies for predicting cardiovascular disease risk factors from multi-omic data. BMC Med Inform Decis Mak. 2024; 24(1):116. PMC: 11064347. DOI: 10.1186/s12911-024-02521-3. View

Nakamura M, Kajiwara Y, Otsuka A, Kimura H . LVQ-SMOTE - Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data. BioData Min. 2013; 6(1):16. PMC: 4016036. DOI: 10.1186/1756-0381-6-16. View

Lu S, Yang J, Gu Y, He D, Wu H, Sun W . Advances in Machine Learning Processing of Big Data from Disease Diagnosis Sensors. ACS Sens. 2024; 9(3):1134-1148. DOI: 10.1021/acssensors.3c02670. View

Kosolwattana T, Liu C, Hu R, Han S, Chen H, Lin Y . A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare. BioData Min. 2023; 16(1):15. PMC: 10131309. DOI: 10.1186/s13040-023-00330-4. View

10.

Labory J, Njomgue-Fotso E, Bottini S . Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data. Comput Struct Biotechnol J. 2024; 23:1274-1287. PMC: 10979063. DOI: 10.1016/j.csbj.2024.03.016. View

11.

Jia C, Zuo Y . S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol. 2017; 422:84-89. DOI: 10.1016/j.jtbi.2017.03.031. View

12.

Fu G, Wu Y, Zong M, Pan J . Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinformatics. 2020; 21(1):121. PMC: 7092448. DOI: 10.1186/s12859-020-3411-3. View

13.

Ng W, Xu S, Zhang J, Tian X, Rong T, Kwong S . Hashing-Based Undersampling Ensemble for Imbalanced Pattern Classification Problems. IEEE Trans Cybern. 2020; 52(2):1269-1279. DOI: 10.1109/TCYB.2020.3000754. View

14.

Fotouhi S, Asadi S, Kattan M . A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform. 2019; 90:103089. DOI: 10.1016/j.jbi.2018.12.003. View

15.

Beinecke J, Heider D . Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making. BioData Min. 2021; 14(1):49. PMC: 8628399. DOI: 10.1186/s13040-021-00283-6. View

16.

Li J, Fong S, Sung Y, Cho K, Wong R, Wong K . Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Min. 2016; 9:37. PMC: 5131504. DOI: 10.1186/s13040-016-0117-1. View

17.

Zhang L, Geisler T, Ray H, Xie Y . Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. J Appl Stat. 2022; 49(13):3257-3277. PMC: 9542776. DOI: 10.1080/02664763.2021.1939662. View

18.

Yang H, Li X, Cao H, Cui Y, Luo Y, Liu J . Using machine learning methods to predict hepatic encephalopathy in cirrhotic patients with unbalanced data. Comput Methods Programs Biomed. 2021; 211:106420. DOI: 10.1016/j.cmpb.2021.106420. View

19.

Munshi R . Novel ensemble learning approach with SVM-imputed ADASYN features for enhanced cervical cancer prediction. PLoS One. 2024; 19(1):e0296107. PMC: 10781159. DOI: 10.1371/journal.pone.0296107. View