Democratizing EHR Analyses with FIDDLE: a Flexible Data-driven Preprocessing Pipeline for Structured Clinical Data

Overview

Journal J Am Med Inform Assoc

Publisher Oxford University Press

Specialty Medical Informatics

Date 2020 Oct 11

PMID 33040151

Citations 29

Authors

Shengpu Tang

Parmida Davarmanesh

Yanmeng Song

Danai Koutra

Michael W Sjoding

Jenna Wiens

Affiliations

Soon will be listed here.

Abstract

Objective: In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR.

Materials And Methods: Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines.

Results: Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments.

Conclusions: FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.

Citing Articles

Learning and diSentangling patient static information from time-series Electronic hEalth Records (STEER).

Liao W, Voldman J PLOS Digit Health. 2024; 3(10):e0000640.

PMID: 39432484 PMC: 11493250. DOI: 10.1371/journal.pdig.0000640.

Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions.

Cui S, Wang J, Zhong Y, Liu H, Wang T, Ma F Proc SIAM Int Conf Data Min. 2024; 2024:361-369.

PMID: 39399238 PMC: 11469647. DOI: 10.1137/1.9781611978032.41.

An open-source framework for end-to-end analysis of electronic health record data.

Heumos L, Ehmele P, Treis T, Upmeier Zu Belzen J, Roellin E, May L Nat Med. 2024; 30(11):3369-3380.

PMID: 39266748 PMC: 11564094. DOI: 10.1038/s41591-024-03214-0.

A scalable and transparent data pipeline for AI-enabled health data ecosystems.

Namli T, Sinaci A, Gonul S, Herguido C, Garcia-Canadilla P, Munoz A Front Med (Lausanne). 2024; 11:1393123.

PMID: 39139784 PMC: 11321077. DOI: 10.3389/fmed.2024.1393123.

Affordable and real-time antimicrobial resistance prediction from multimodal electronic health records.

Hardan S, Shaaban M, Abdalla J, Yaqub M Sci Rep. 2024; 14(1):16464.

PMID: 39013934 PMC: 11252127. DOI: 10.1038/s41598-024-66812-5.

References

Zeiberg D, Prahlad T, Nallamothu B, Iwashyna T, Wiens J, Sjoding M . Machine learning for patient risk stratification for acute respiratory distress syndrome. PLoS One. 2019; 14(3):e0214465. PMC: 6438573. DOI: 10.1371/journal.pone.0214465. View

Koyner J, Carey K, Edelson D, Churpek M . The Development of a Machine Learning Inpatient Acute Kidney Injury Prediction Model. Crit Care Med. 2018; 46(7):1070-1077. DOI: 10.1097/CCM.0000000000003123. View

Sherman E, Gurm H, Balis U, Owens S, Wiens J . Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale. AMIA Annu Symp Proc. 2018; 2017:1571-1580. PMC: 5977714. View

Li B, Oh J, Young V, Rao K, Wiens J . Using Machine Learning and the Electronic Health Record to Predict Complicated Infection. Open Forum Infect Dis. 2019; 6(5):ofz186. PMC: 6527086. DOI: 10.1093/ofid/ofz186. View

LaFleur B, Greevy R . Introduction to permutation and resampling-based hypothesis tests. J Clin Child Adolesc Psychol. 2009; 38(2):286-94. DOI: 10.1080/15374410902740411. View

Wiens J, Guttag J, Horvitz E . A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions. J Am Med Inform Assoc. 2014; 21(4):699-706. PMC: 4078276. DOI: 10.1136/amiajnl-2013-002162. View

Nemati S, Ghassemi M, Clifford G . Optimal medication dosing from suboptimal clinical examples: a deep reinforcement learning approach. Annu Int Conf IEEE Eng Med Biol Soc. 2017; 2016:2978-2981. DOI: 10.1109/EMBC.2016.7591355. View

OMalley K, Cook K, Price M, Wildes K, Hurdle J, Ashton C . Measuring diagnoses: ICD code accuracy. Health Serv Res. 2005; 40(5 Pt 2):1620-39. PMC: 1361216. DOI: 10.1111/j.1475-6773.2005.00444.x. View

Fleurence R, Curtis L, Califf R, Platt R, Selby J, Brown J . Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014; 21(4):578-82. PMC: 4078292. DOI: 10.1136/amiajnl-2014-002747. View

10.

Churpek M, Adhikari R, Edelson D . The value of vital sign trends for detecting clinical deterioration on the wards. Resuscitation. 2016; 102:1-5. PMC: 4834231. DOI: 10.1016/j.resuscitation.2016.02.005. View

11.

Silva I, Moody G, Scott D, Celi L, Mark R . Predicting In-Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. Comput Cardiol (2010). 2014; 39:245-248. PMC: 3965265. View

12.

Fiterau M, Bhooshan S, Fries J, Bournhonesque C, Hicks J, Halilaj E . ShortFuse: Biomedical Time Series Representations in the Presence of Structured Information. Proc Mach Learn Res. 2019; 68:59-74. PMC: 6417829. View

13.

Desautels T, Calvert J, Hoffman J, Jay M, Kerem Y, Shieh L . Prediction of Sepsis in the Intensive Care Unit With Minimal Electronic Health Record Data: A Machine Learning Approach. JMIR Med Inform. 2016; 4(3):e28. PMC: 5065680. DOI: 10.2196/medinform.5909. View

14.

Wiens J, Shenoy E . Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare Epidemiology. Clin Infect Dis. 2017; 66(1):149-153. PMC: 5850539. DOI: 10.1093/cid/cix731. View

15.

Johnson A, Pollard T, Shen L, Lehman L, Feng M, Ghassemi M . MIMIC-III, a freely accessible critical care database. Sci Data. 2016; 3:160035. PMC: 4878278. DOI: 10.1038/sdata.2016.35. View

16.

Cochran W . The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics. 1968; 24(2):295-313. View

17.

Rajkomar A, Oren E, Chen K, Dai A, Hajaj N, Hardt M . Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2019; 1:18. PMC: 6550175. DOI: 10.1038/s41746-018-0029-1. View

18.

Harutyunyan H, Khachatrian H, Kale D, Steeg G, Galstyan A . Multitask learning and benchmarking with clinical time series data. Sci Data. 2019; 6(1):96. PMC: 6572845. DOI: 10.1038/s41597-019-0103-9. View

19.

Wiens J, Saria S, Sendak M, Ghassemi M, Liu V, Doshi-Velez F . Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019; 25(9):1337-1340. DOI: 10.1038/s41591-019-0548-6. View

20.

Sendak M, Balu S, Schulman K . Barriers to Achieving Economies of Scale in Analysis of EHR Data. A Cautionary Tale. Appl Clin Inform. 2017; 8(3):826-831. PMC: 6220705. DOI: 10.4338/ACI-2017-03-CR-0046. View