A Simulation Study on Missing Data Imputation for Dichotomous Variables Using Statistical and Machine Learning Methods

Overview

Journal Sci Rep

Specialty Science

Date 2023 Jun 9

PMID 37296269

Authors

Yingfeng Ge

Zhiwei Li

Jinxin Zhang

Affiliations

Soon will be listed here.

Abstract

The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.

Citing Articles

An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases.

Smelik M, Zhao Y, Li X, Loscalzo J, Sysoev O, Mahmud F Sci Rep. 2024; 14(1):12710.

PMID: 38830935 PMC: 11148091. DOI: 10.1038/s41598-024-63399-9.

An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases.

Benson M, Smelik M, Li X, Loscalzo J, Sysoev O, Mahmud F Res Sq. 2024; .

PMID: 38496611 PMC: 10942575. DOI: 10.21203/rs.3.rs-3921099/v1.

References

Waljee A, Mukherjee A, Singal A, Zhang Y, Warren J, Balis U . Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. 2013; 3(8). PMC: 3733317. DOI: 10.1136/bmjopen-2013-002847. View

Zhou M, He Y, Yu M, Hsu C . A nonparametric multiple imputation approach for missing categorical data. BMC Med Res Methodol. 2017; 17(1):87. PMC: 5461637. DOI: 10.1186/s12874-017-0360-2. View

Zhang Y, Xin Y, Li Q, Ma J, Li S, Lv X . Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications. Biomed Eng Online. 2017; 16(1):125. PMC: 5668968. DOI: 10.1186/s12938-017-0416-x. View

Guo C, Yang Y, Chen Y . The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model. Front Public Health. 2021; 9:680054. PMC: 8289437. DOI: 10.3389/fpubh.2021.680054. View

Wei R, Wang J, Su M, Jia E, Chen S, Chen T . Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data. Sci Rep. 2018; 8(1):663. PMC: 5766532. DOI: 10.1038/s41598-017-19120-0. View

Dong W, Fong D, Yoon J, Wan E, Bedford L, Tang E . Generative adversarial networks for imputing missing data for big data clinical research. BMC Med Res Methodol. 2021; 21(1):78. PMC: 8059005. DOI: 10.1186/s12874-021-01272-3. View

Barnard J, Meng X . Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat Methods Med Res. 1999; 8(1):17-36. DOI: 10.1177/096228029900800103. View

Wu W, Jia F, Enders C . A Comparison of Imputation Strategies for Ordinal Missing Data on Likert Scale Variables. Multivariate Behav Res. 2015; 50(5):484-503. DOI: 10.1080/00273171.2015.1022644. View

Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H . Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014; 179(6):764-74. PMC: 3939843. DOI: 10.1093/aje/kwt312. View

10.

Olivier J, Bell M . Effect sizes for 2×2 contingency tables. PLoS One. 2013; 8(3):e58777. PMC: 3591379. DOI: 10.1371/journal.pone.0058777. View

11.

Jang J, Manatunga A, Chang C, Long Q . A Bayesian multiple imputation approach to bivariate functional data with missing components. Stat Med. 2021; 40(22):4772-4793. PMC: 9125166. DOI: 10.1002/sim.9093. View

12.

Hong S, Lynn H . Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol. 2020; 20(1):199. PMC: 7382855. DOI: 10.1186/s12874-020-01080-1. View

13.

Wang H, Tang J, Wu M, Wang X, Zhang T . Application of machine learning missing data imputation techniques in clinical decision making: taking the discharge assessment of patients with spontaneous supratentorial intracerebral hemorrhage as an example. BMC Med Inform Decis Mak. 2022; 22(1):13. PMC: 8756624. DOI: 10.1186/s12911-022-01752-6. View

14.

Rubin D, Schenker N . Multiple imputation in health-care databases: an overview and some applications. Stat Med. 1991; 10(4):585-98. DOI: 10.1002/sim.4780100410. View

15.

Jerez J, Molina I, Garcia-Laencina P, Alba E, Ribelles N, Martin M . Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010; 50(2):105-15. DOI: 10.1016/j.artmed.2010.05.002. View

16.

Schafer J . Multiple imputation: a primer. Stat Methods Med Res. 1999; 8(1):3-15. DOI: 10.1177/096228029900800102. View

17.

Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T . Multiple Kernel k-Means with Incomplete Kernels. IEEE Trans Pattern Anal Mach Intell. 2019; 42(5):1191-1204. PMC: 6626696. DOI: 10.1109/TPAMI.2019.2892416. View