» Articles » PMID: 34920127

Word Embeddings Trained on Published Case Reports Are Lightweight, Effective for Clinical Tasks, and Free of Protected Health Information

Overview
Journal J Biomed Inform
Publisher Elsevier
Date 2021 Dec 17
PMID 34920127
Citations 4
Authors
Affiliations
Soon will be listed here.
Abstract

Objective: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings.

Materials And Methods: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively.

Results: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS).

Discussion: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use.

Conclusion: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.

Citing Articles

Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing.

Grabar N, Grouin C Yearb Med Inform. 2022; 31(1):254-260.

PMID: 36463883 PMC: 9719758. DOI: 10.1055/s-0042-1742547.


Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms.

Saeed N, Naveed H Front Mol Biosci. 2022; 9:928530.

PMID: 36032678 PMC: 9411640. DOI: 10.3389/fmolb.2022.928530.


Development and validation of a prediction model for actionable aspects of frailty in the text of clinicians' encounter notes.

Martin J, Crane-Droesch A, Lapite F, Puhl J, Kmiec T, Silvestri J J Am Med Inform Assoc. 2021; 29(1):109-119.

PMID: 34791302 PMC: 8714261. DOI: 10.1093/jamia/ocab248.


Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation.

Parikh S, Davoudi A, Yu S, Giraldo C, Schriver E, Mowery D JMIR Med Inform. 2021; 9(2):e21679.

PMID: 33544689 PMC: 7901592. DOI: 10.2196/21679.

References
1.
Major V, Surkis A, Aphinyanaphongs Y . Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research. AMIA Annu Symp Proc. 2019; 2018:1405-1414. PMC: 6371342. View

2.
Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov P, Mark R . PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000; 101(23):E215-20. DOI: 10.1161/01.cir.101.23.e215. View

3.
Abdalla M, Abdalla M, Hirst G, Rudzicz F . Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study. J Med Internet Res. 2020; 22(7):e18055. PMC: 7391163. DOI: 10.2196/18055. View

4.
Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton G . Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. AMIA Annu Symp Proc. 2011; 2010:572-6. PMC: 3041430. View

5.
Zhang Y, Chen Q, Yang Z, Lin H, Lu Z . BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data. 2019; 6(1):52. PMC: 6510737. DOI: 10.1038/s41597-019-0055-0. View