» Articles » PMID: 39574876

Reducing Information and Selection Bias in EHR-Linked Biobanks Via Genetics-Informed Multiple Imputation and Sample Weighting

Overview
Journal medRxiv
Date 2024 Nov 22
PMID 39574876
Authors
Affiliations
Soon will be listed here.
Abstract

Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.

References
1.
Shen C, Weissfeld L . Application of pattern-mixture models to outcomes that are potentially missing not at random using pseudo maximum likelihood estimation. Biostatistics. 2005; 6(2):333-47. DOI: 10.1093/biostatistics/kxi013. View

2.
Beesley L, Salvatore M, Fritsche L, Pandit A, Rao A, Brummett C . The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. Stat Med. 2019; 39(6):773-800. PMC: 7983809. DOI: 10.1002/sim.8445. View

3.
Collins L, Schafer J, Kam C . A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2002; 6(4):330-51. View

4.
Wei W, Leibson C, Ransom J, Kho A, Caraballo P, Chai H . Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus. J Am Med Inform Assoc. 2012; 19(2):219-24. PMC: 3277630. DOI: 10.1136/amiajnl-2011-000597. View

5.
Lu C . Observational studies: a review of study designs, challenges and strategies to reduce confounding. Int J Clin Pract. 2009; 63(5):691-7. DOI: 10.1111/j.1742-1241.2009.02056.x. View