» Articles » PMID: 37652934

Synthesize High-dimensional Longitudinal Electronic Health Records Via Hierarchical Autoregressive Language Model

Overview
Journal Nat Commun
Specialty Biology
Date 2023 Aug 31
PMID 37652934
Authors
Affiliations
Soon will be listed here.
Abstract

Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.

Citing Articles

Improving medical machine learning models with generative balancing for equity and excellence.

Theodorou B, Danek B, Tummala V, Kumar S, Malin B, Sun J NPJ Digit Med. 2025; 8(1):100.

PMID: 39953146 PMC: 11828851. DOI: 10.1038/s41746-025-01438-z.


Synthetic Health Data: Real Ethical Promise and Peril.

Susser D, Schiff D, Gerke S, Cabrera L, Cohen I, Doerr M Hastings Cent Rep. 2024; 54(5):8-13.

PMID: 39487776 PMC: 11555762. DOI: 10.1002/hast.4911.


Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models.

Tian M, Chen B, Guo A, Jiang S, Zhang A J Am Med Inform Assoc. 2024; 31(11):2529-2539.

PMID: 39222376 PMC: 11491591. DOI: 10.1093/jamia/ocae229.


On the evaluation of synthetic longitudinal electronic health records.

Achterberg J, Haas M, Spruit M BMC Med Res Methodol. 2024; 24(1):181.

PMID: 39143466 PMC: 11323671. DOI: 10.1186/s12874-024-02304-4.

References
1.
El Emam K, Buckeridge D, Tamblyn R, Neisa A, Jonker E, Verma A . The re-identification risk of Canadians from longitudinal demographics. BMC Med Inform Decis Mak. 2011; 11:46. PMC: 3151203. DOI: 10.1186/1472-6947-11-46. View

2.
Choi E, Bahadori M, Song L, Stewart W, Sun J . GRAM: Graph-based Attention Model for Healthcare Representation Learning. KDD. 2021; 2017:787-795. PMC: 7954122. DOI: 10.1145/3097983.3098126. View

3.
Fu T, Hoang T, Xiao C, Sun J . DDL: Deep Dictionary Learning for Predictive Phenotyping. IJCAI (U S). 2021; 2019:5857-5863. PMC: 7990269. DOI: 10.24963/ijcai.2019/812. View

4.
Zhang Z, Yan C, Lasko T, Sun J, Malin B . SynTEG: a framework for temporal structured electronic health data simulation. J Am Med Inform Assoc. 2020; 28(3):596-604. PMC: 7936402. DOI: 10.1093/jamia/ocaa262. View

5.
Lee S . Natural language generation for electronic health records. NPJ Digit Med. 2019; 1:63. PMC: 6345174. View