Language Models Are an Effective Representation Learning Technique for Electronic Health Record Data

Overview

Journal J Biomed Inform

Publisher Elsevier

Specialty Medical Informatics

Date 2020 Dec 8

PMID 33290879

Citations 35

Authors

Ethan Steinberg

Ken Jung

Jason A Fries

Conor K Corbin

Stephen R Pfohl

Nigam H Shah

Affiliations

Soon will be listed here.

Abstract

Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.

Citing Articles

A machine learning approach to leveraging electronic health records for enhanced omics analysis.

Mataraso S, Espinosa C, Seong D, Reincke S, Berson E, Reiss J Nat Mach Intell. 2025; 7(2):293-306.

PMID: 40008295 PMC: 11847705. DOI: 10.1038/s42256-024-00974-9.

A roadmap to implementing machine learning in healthcare: from concept to practice.

Yan A, Guo L, Inoue J, Arciniegas S, Vettese E, Wolochacz A Front Digit Health. 2025; 7:1462751.

PMID: 39906065 PMC: 11788154. DOI: 10.3389/fdgth.2025.1462751.

Developing a Research Center for Artificial Intelligence in Medicine.

Langlotz C, Kim J, Shah N, Lungren M, Larson D, Datta S Mayo Clin Proc Digit Health. 2025; 2(4):677-686.

PMID: 39802660 PMC: 11720458. DOI: 10.1016/j.mcpdig.2024.07.005.

Unified Clinical Vocabulary Embeddings for Advancing Precision Medicine.

Johnson R, Gottlieb U, Shaham G, Eisen L, Waxman J, Devons-Sberro S medRxiv. 2024; .

PMID: 39677476 PMC: 11643188. DOI: 10.1101/2024.12.03.24318322.

Debiasing large language models: research opportunities.

Yogarajan V, Dobbie G, Keegan T J R Soc N Z. 2024; 55(2):372-395.

PMID: 39677375 PMC: 11639098. DOI: 10.1080/03036758.2024.2398567.

References

Sherman E, Gurm H, Balis U, Owens S, Wiens J . Leveraging Clinical Time-Series Data for Prediction: A Cautionary Tale. AMIA Annu Symp Proc. 2018; 2017:1571-1580. PMC: 5977714. View

Choi E, Bahadori M, Schuetz A, Stewart W, Sun J . Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. JMLR Workshop Conf Proc. 2017; 56:301-318. PMC: 5341604. View

Dhudasia M, Mukhopadhyay S, Puopolo K . Implementation of the Sepsis Risk Calculator at an Academic Birth Hospital. Hosp Pediatr. 2018; 8(5):243-250. DOI: 10.1542/hpeds.2017-0180. View

Wiens J, Guttag J, Horvitz E . A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions. J Am Med Inform Assoc. 2014; 21(4):699-706. PMC: 4078276. DOI: 10.1136/amiajnl-2013-002162. View

Miotto R, Li L, Kidd B, Dudley J . Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci Rep. 2016; 6:26094. PMC: 4869115. DOI: 10.1038/srep26094. View

Choi E, Bahadori M, Song L, Stewart W, Sun J . GRAM: Graph-based Attention Model for Healthcare Representation Learning. KDD. 2021; 2017:787-795. PMC: 7954122. DOI: 10.1145/3097983.3098126. View

Choi Y, Chiu C, Sontag D . Learning Low-Dimensional Representations of Medical Concepts. AMIA Jt Summits Transl Sci Proc. 2016; 2016:41-50. PMC: 5001761. View

Paulson S, Dummett B, Green J, Scruth E, Reyes V, Escobar G . What Do We Do After the Pilot Is Done? Implementation of a Hospital Early Warning System at Scale. Jt Comm J Qual Patient Saf. 2020; 46(4):207-216. DOI: 10.1016/j.jcjq.2020.01.003. View

Wiens J, Saria S, Sendak M, Ghassemi M, Liu V, Doshi-Velez F . Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019; 25(9):1337-1340. DOI: 10.1038/s41591-019-0548-6. View

10.

Bodenreider O . The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2003; 32(Database issue):D267-70. PMC: 308795. DOI: 10.1093/nar/gkh061. View

11.

Nguyen P, Tran T, Wickramasinghe N, Venkatesh S . $\mathtt {Deepr}$: A Convolutional Net for Medical Records. IEEE J Biomed Health Inform. 2016; 21(1):22-30. DOI: 10.1109/JBHI.2016.2633963. View

12.

Shilo S, Rossman H, Segal E . Axes of a revolution: challenges and promises of big data in healthcare. Nat Med. 2020; 26(1):29-38. DOI: 10.1038/s41591-019-0727-5. View

13.

Tamang S, Milstein A, Sorensen H, Pedersen L, Mackey L, Betterton J . Predicting patient 'cost blooms' in Denmark: a longitudinal population-based study. BMJ Open. 2017; 7(1):e011580. PMC: 5253526. DOI: 10.1136/bmjopen-2016-011580. View

14.

Cronin P, Greenwald J, Crevensten G, Chueh H, Zai A . Development and implementation of a real-time 30-day readmission predictive model. AMIA Annu Symp Proc. 2015; 2014:424-31. PMC: 4419988. View

15.

Chen D, Liu S, Kingsbury P, Sohn S, Storlie C, Habermann E . Deep learning and alternative learning strategies for retrospective real-world clinical data. NPJ Digit Med. 2019; 2:43. PMC: 6550223. DOI: 10.1038/s41746-019-0122-0. View

16.

Choi E, Schuetz A, Stewart W, Sun J . Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc. 2016; 24(2):361-370. PMC: 5391725. DOI: 10.1093/jamia/ocw112. View

17.

Goldstein B, Navar A, Pencina M, Ioannidis J . Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc. 2016; 24(1):198-208. PMC: 5201180. DOI: 10.1093/jamia/ocw042. View

18.

Rajkomar A, Oren E, Chen K, Dai A, Hajaj N, Hardt M . Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2019; 1:18. PMC: 6550175. DOI: 10.1038/s41746-018-0029-1. View

19.

Norgeot B, Glicksberg B, Butte A . A call for deep-learning healthcare. Nat Med. 2019; 25(1):14-15. DOI: 10.1038/s41591-018-0320-3. View

20.

Banda J, Sarraju A, Abbasi F, Parizo J, Pariani M, Ison H . Finding missed cases of familial hypercholesterolemia in health systems using machine learning. NPJ Digit Med. 2019; 2:23. PMC: 6550268. DOI: 10.1038/s41746-019-0101-5. View