Predicting Disease Risks from Highly Imbalanced Data Using Random Forest

Overview

Journal BMC Med Inform Decis Mak

Publisher Biomed Central

Specialty Medical Informatics

Date 2011 Aug 2

PMID 21801360

Citations 133

Authors

Mohammed Khalilia

Sounak Chakraborty

Mihail Popescu

Affiliations

Soon will be listed here.

Abstract

Background: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.

Methods: We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.

Results: We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.

Conclusions: In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

Citing Articles

Development of a Machine Learning-Powered Optimized Lung Allocation System for Maximum Benefits in Lung Transplantation: A Korean National Data.

Ha M, Cho W, So M, Lee D, Kim Y, Yeo H J Korean Med Sci. 2025; 40(7):e18.

PMID: 39995255 PMC: 11858608. DOI: 10.3346/jkms.2025.40.e18.

Derivation and validation of a clinical predictive model for longer duration diarrhea among pediatric patients in Kenya using machine learning algorithms.

Ogwel B, Mzazi V, Awuor A, Okonji C, Anyango R, Oreso C BMC Med Inform Decis Mak. 2025; 25(1):28.

PMID: 39815316 PMC: 11737202. DOI: 10.1186/s12911-025-02855-6.

Predicting Parkinson's Disease Using a Deep-Learning Algorithm to Analyze Prodromal Medical and Prescription Data.

Koo Y, Kim M, Lee W J Clin Neurol. 2025; 21(1):21-30.

PMID: 39778564 PMC: 11711266. DOI: 10.3988/jcn.2024.0175.

Low-carbohydrate diet score and chronic obstructive pulmonary disease: a machine learning analysis of NHANES data.

Zhang X, Mo J, Yang K, Tan T, Zhao C, Qin H Front Nutr. 2025; 11():1519782.

PMID: 39777077 PMC: 11706202. DOI: 10.3389/fnut.2024.1519782.

Application of machine learning for mass spectrometry-based multi-omics in thyroid diseases.

Che Y, Zhao M, Gao Y, Zhang Z, Zhang X Front Mol Biosci. 2025; 11:1483326.

PMID: 39741929 PMC: 11685090. DOI: 10.3389/fmolb.2024.1483326.

References

Skubic M, Alexander G, Popescu M, Rantz M, Keller J . A smart home application to eldercare: current status and lessons learned. Technol Health Care. 2009; 17(3):183-201. DOI: 10.3233/THC-2009-0551. View

Yu W, Liu T, Valdez R, Gwinn M, Khoury M . Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010; 10:16. PMC: 2850872. DOI: 10.1186/1472-6947-10-16. View

Menze B, Kelm B, Masuch R, Himmelreich U, Bachert P, Petrich W . A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics. 2009; 10:213. PMC: 2724423. DOI: 10.1186/1471-2105-10-213. View

Cohen E, Caburnay C, Luke D, Rodgers S, Cameron G, Kreuter M . Cancer coverage in general-audience and Black newspapers. Health Commun. 2008; 23(5):427-35. DOI: 10.1080/10410230802342176. View

Palmer D, OBoyle N, Glen R, Mitchell J . Random forest models to predict aqueous solubility. J Chem Inf Model. 2007; 47(1):150-8. DOI: 10.1021/ci060164k. View