» Articles » PMID: 37296269

A Simulation Study on Missing Data Imputation for Dichotomous Variables Using Statistical and Machine Learning Methods

Overview
Journal Sci Rep
Specialty Science
Date 2023 Jun 9
PMID 37296269
Authors
Affiliations
Soon will be listed here.
Abstract

The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.

Citing Articles

An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases.

Smelik M, Zhao Y, Li X, Loscalzo J, Sysoev O, Mahmud F Sci Rep. 2024; 14(1):12710.

PMID: 38830935 PMC: 11148091. DOI: 10.1038/s41598-024-63399-9.


An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases.

Benson M, Smelik M, Li X, Loscalzo J, Sysoev O, Mahmud F Res Sq. 2024; .

PMID: 38496611 PMC: 10942575. DOI: 10.21203/rs.3.rs-3921099/v1.

References
1.
Waljee A, Mukherjee A, Singal A, Zhang Y, Warren J, Balis U . Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013; 3(8). PMC: 3733317. DOI: 10.1136/bmjopen-2013-002847. View

2.
Zhou M, He Y, Yu M, Hsu C . A nonparametric multiple imputation approach for missing categorical data. BMC Med Res Methodol. 2017; 17(1):87. PMC: 5461637. DOI: 10.1186/s12874-017-0360-2. View

3.
Zhang Y, Xin Y, Li Q, Ma J, Li S, Lv X . Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications. Biomed Eng Online. 2017; 16(1):125. PMC: 5668968. DOI: 10.1186/s12938-017-0416-x. View

4.
Guo C, Yang Y, Chen Y . The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model. Front Public Health. 2021; 9:680054. PMC: 8289437. DOI: 10.3389/fpubh.2021.680054. View

5.
Wei R, Wang J, Su M, Jia E, Chen S, Chen T . Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci Rep. 2018; 8(1):663. PMC: 5766532. DOI: 10.1038/s41598-017-19120-0. View