» Articles » PMID: 27924347

Preprocessing Structured Clinical Data for Predictive Modeling and Decision Support. A Roadmap to Tackle the Challenges

Overview
Publisher Thieme
Date 2016 Dec 8
PMID 27924347
Citations 8
Authors
Affiliations
Soon will be listed here.
Abstract

Background: EHR systems have high potential to improve healthcare delivery and management. Although structured EHR data generates information in machine-readable formats, their use for decision support still poses technical challenges for researchers due to the need to preprocess and convert data into a matrix format. During our research, we observed that clinical informatics literature does not provide guidance for researchers on how to build this matrix while avoiding potential pitfalls.

Objectives: This article aims to provide researchers a roadmap of the main technical challenges of preprocessing structured EHR data and possible strategies to overcome them.

Methods: Along standard data processing stages - extracting database entries, defining features, processing data, assessing feature values and integrating data elements, within an EDPAI framework -, we identified the main challenges faced by researchers and reflect on how to address those challenges based on lessons learned from our research experience and on best practices from related literature. We highlight the main potential sources of error, present strategies to approach those challenges and discuss implications of these strategies.

Results: Following the EDPAI framework, researchers face five key challenges: (1) gathering and integrating data, (2) identifying and handling different feature types, (3) combining features to handle redundancy and granularity, (4) addressing data missingness, and (5) handling multiple feature values. Strategies to address these challenges include: cross-checking identifiers for robust data retrieval and integration; applying clinical knowledge in identifying feature types, in addressing redundancy and granularity, and in accommodating multiple feature values; and investigating missing patterns adequately.

Conclusions: This article contributes to literature by providing a roadmap to inform structured EHR data preprocessing. It may advise researchers on potential pitfalls and implications of methodological decisions in handling structured data, so as to avoid biases and help realize the benefits of the secondary use of EHR data.

Citing Articles

Conceptualizing bias in EHR data: A case study in performance disparities by demographic subgroups for a pediatric obesity incidence classifier.

Campbell E, Bose S, Masino A PLOS Digit Health. 2024; 3(10):e0000642.

PMID: 39441784 PMC: 11498669. DOI: 10.1371/journal.pdig.0000642.


Automating Electronic Health Record Data Quality Assessment.

Ozonze O, Scott P, Hopgood A J Med Syst. 2023; 47(1):23.

PMID: 36781551 PMC: 9925537. DOI: 10.1007/s10916-022-01892-2.


Developing, implementing and governing artificial intelligence in medicine: a step-by-step approach to prevent an artificial intelligence winter.

van de Sande D, van Genderen M, Smit J, Huiskens J, Visser J, Veen R BMJ Health Care Inform. 2022; 29(1).

PMID: 35185012 PMC: 8860016. DOI: 10.1136/bmjhci-2021-100495.


Subcategorizing EHR diagnosis codes to improve clinical application of machine learning models.

Reimer A, Dai W, Smith B, Schiltz N, Sun J, Koroukian S Int J Med Inform. 2021; 156:104588.

PMID: 34607290 PMC: 8571032. DOI: 10.1016/j.ijmedinf.2021.104588.


Use of machine learning to transform complex standardized nursing care plan data into meaningful research variables: a palliative care exemplar.

Macieira T, Yao Y, Keenan G J Am Med Inform Assoc. 2021; 28(12):2695-2701.

PMID: 34569603 PMC: 8633646. DOI: 10.1093/jamia/ocab205.


References
1.
Bradshaw R, Matney S, Livne O, Bray B, Mitchell J, Narus S . Architecture of a federated query engine for heterogeneous resources. AMIA Annu Symp Proc. 2010; 2009:70-4. PMC: 2815441. View

2.
Brazhnik O, Jones J . Anatomy of data integration. J Biomed Inform. 2006; 40(3):252-69. PMC: 2094006. DOI: 10.1016/j.jbi.2006.09.001. View

3.
Danciu I, Cowan J, Basford M, Wang X, Saip A, Osgood S . Secondary use of clinical data: the Vanderbilt approach. J Biomed Inform. 2014; 52:28-35. PMC: 4133331. DOI: 10.1016/j.jbi.2014.02.003. View

4.
Dolin R, Alschuler L, Boyer S, Beebe C, Behlen F, Biron P . HL7 Clinical Document Architecture, Release 2. J Am Med Inform Assoc. 2005; 13(1):30-9. PMC: 1380194. DOI: 10.1197/jamia.M1888. View

5.
Rosenbloom S, Stead W, Denny J, Giuse D, Lorenzi N, Brown S . Generating Clinical Notes for Electronic Health Record Systems. Appl Clin Inform. 2010; 1(3):232-243. PMC: 2963994. DOI: 10.4338/ACI-2010-03-RA-0019. View