Democratizing EHR Analyses with FIDDLE: a Flexible Data-driven Preprocessing Pipeline for Structured Clinical Data
Overview
Affiliations
Objective: In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.
Materials And Methods: Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.
Results: Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.
Conclusions: FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.
Liao W, Voldman J PLOS Digit Health. 2024; 3(10):e0000640.
PMID: 39432484 PMC: 11493250. DOI: 10.1371/journal.pdig.0000640.
Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions.
Cui S, Wang J, Zhong Y, Liu H, Wang T, Ma F Proc SIAM Int Conf Data Min. 2024; 2024:361-369.
PMID: 39399238 PMC: 11469647. DOI: 10.1137/1.9781611978032.41.
An open-source framework for end-to-end analysis of electronic health record data.
Heumos L, Ehmele P, Treis T, Upmeier Zu Belzen J, Roellin E, May L Nat Med. 2024; 30(11):3369-3380.
PMID: 39266748 PMC: 11564094. DOI: 10.1038/s41591-024-03214-0.
A scalable and transparent data pipeline for AI-enabled health data ecosystems.
Namli T, Sinaci A, Gonul S, Herguido C, Garcia-Canadilla P, Munoz A Front Med (Lausanne). 2024; 11:1393123.
PMID: 39139784 PMC: 11321077. DOI: 10.3389/fmed.2024.1393123.
Hardan S, Shaaban M, Abdalla J, Yaqub M Sci Rep. 2024; 14(1):16464.
PMID: 39013934 PMC: 11252127. DOI: 10.1038/s41598-024-66812-5.