» Articles » PMID: 39545787

Generating Pregnant Patient Biological Profiles by Deconvoluting Clinical Records with Electronic Health Record Foundation Models

Abstract

Translational biology posits a strong bi-directional link between clinical phenotypes and a patient's biological profile. By leveraging this bi-directional link, we can efficiently deconvolute pre-existing clinical information into biological profiles. However, traditional computational tools are limited in their ability to resolve this link because of the relatively small sizes of paired clinical-biological datasets for training and the high dimensionality/sparsity of tabular clinical data. Here, we use state-of-the-art foundation models (FMs) for electronic health record (EHR) data to generate proteomics profiles of pregnant patients, thereby deconvoluting pre-existing clinical information into biological profiles without the cost and effort of running large-scale traditional omics studies. We show that FM-derived representations of a patient's EHR data coupled with a fully connected neural network prediction head can generate 206 blood protein expression levels. Interestingly, these proteins were enriched for developmental pathways, while proteins not able to be generated from EHR data were enriched for metabolic pathways. Finally, we show a proteomic signature of gestational diabetes that includes proteins with established and novel links to gestational diabetes. These results showcase the power of FM-derived EHR representations in efficiently generating biological states of pregnant patients. This capability can revolutionize disease understanding and therapeutic development, offering a cost-effective, time-efficient, and less invasive alternative to traditional methods of generating proteomics.

Citing Articles

A machine learning approach to leveraging electronic health records for enhanced omics analysis.

Mataraso S, Espinosa C, Seong D, Reincke S, Berson E, Reiss J Nat Mach Intell. 2025; 7(2):293-306.

PMID: 40008295 PMC: 11847705. DOI: 10.1038/s42256-024-00974-9.


Advancing neonatal health: the promise and challenges of universal genome sequencing in newborn screening.

Stevenson D, Wong R, Reiss J, Shaw G, Aghaeepour N, Mahzarnia A Pediatr Res. 2025; .

PMID: 39833347 DOI: 10.1038/s41390-025-03874-9.

References
1.
Mahley R . Apolipoprotein E: from cardiovascular disease to neurodegenerative disorders. J Mol Med (Berl). 2016; 94(7):739-46. PMC: 4921111. DOI: 10.1007/s00109-016-1427-y. View

2.
Singh H, Aplin J . Endometrial apical glycoproteomic analysis reveals roles for cadherin 6, desmoglein-2 and plexin b2 in epithelial integrity. Mol Hum Reprod. 2014; 21(1):81-94. DOI: 10.1093/molehr/gau087. View

3.
Wornow M, Xu Y, Thapa R, Patel B, Steinberg E, Fleming S . The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023; 6(1):135. PMC: 10387101. DOI: 10.1038/s41746-023-00879-8. View

4.
Ng A, Wong M, Viviano B, Erlich J, Alba G, Pflederer C . Loss of glypican-3 function causes growth factor-dependent defects in cardiac and coronary vascular development. Dev Biol. 2009; 335(1):208-15. PMC: 2763964. DOI: 10.1016/j.ydbio.2009.08.029. View

5.
Fang Z, Liu X, Peltz G . GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics. 2022; 39(1). PMC: 9805564. DOI: 10.1093/bioinformatics/btac757. View