Review: a Gentle Introduction to Imputation of Missing Values

Overview

Journal J Clin Epidemiol

Publisher Elsevier

Specialty Public Health

Date 2006 Sep 19

PMID 16980149

Citations 684

Authors

A Rogier T Donders

Geert J M G van der Heijden

Theo Stijnen

Karel G M Moons

Affiliations

Soon will be listed here.

Abstract

In most situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the missing-indicator method) produce biased results, whereas imputation techniques yield valid results without complicating the analysis once the imputations are carried out. Imputation techniques are based on the idea that any subject in a study sample can be replaced by a new randomly chosen subject from the same source population. Imputation of missing data on a variable is replacing that missing by a value that is drawn from an estimate of the distribution of this variable. In single imputation, only one estimate is used. In multiple imputation, various estimates are used, reflecting the uncertainty in the estimation of this distribution. Under the general conditions of so-called missing at random and missing completely at random, both single and multiple imputations result in unbiased estimates of study associations. But single imputation results in too small estimated standard errors, whereas multiple imputation results in correctly estimated standard errors and confidence intervals. In this article we explain why all this is the case, and use a simple simulation study to demonstrate our explanations. We also explain and illustrate why two frequently used methods to handle missing data, i.e., overall mean imputation and the missing-indicator method, almost always result in biased estimates.

Citing Articles

Transfer learning for predicting of gross domestic product growth based on remittance inflows using RNN-LSTM hybrid model: a case study of The Gambia.

Jallow H, Mwangi R, Gibba A, Imboga H Front Artif Intell. 2025; 8:1510341.

PMID: 40065783 PMC: 11891165. DOI: 10.3389/frai.2025.1510341.

Integrating epidemiology and genomics data to estimate the prevalence of acquired cysteine drug targets in the U.S. cancer patient population.

Arun A, Liarakos D, Mendiratta G, Kim J, Goshua G, Olson P Pharmacogenomics J. 2025; 25(1-2):5.

PMID: 40044654 DOI: 10.1038/s41397-025-00364-3.

Predicting tuberculosis drug efficacy in preclinical and clinical models from data.

Goh J, Patel A, Ngara B, Van Wijk R, Strydom N, Wang Q iScience. 2025; 28(3):111932.

PMID: 40034847 PMC: 11875147. DOI: 10.1016/j.isci.2025.111932.

Structural equation modeling to explore putative causal factors for chronic fatigue in childhood cancer survivors: a DCCSS LATER study.

Penson A, Bucur I, Walraven I, Grootenhuis M, Maurice-Stam H, van der Heiden-van der Loo M J Cancer Surviv. 2025; .

PMID: 40019719 DOI: 10.1007/s11764-024-01738-5.

Challenge of missing data in observational studies: investigating cross-sectional imputation methods for assessing disease activity in axial spondyloarthritis.

Georgiadis S, Pons M, Rasmussen S, Hetland M, Linde L, Di Giuseppe D RMD Open. 2025; 11(1).

PMID: 39979039 PMC: 11843021. DOI: 10.1136/rmdopen-2024-004844.