» Articles » PMID: 27531100

Corpus Domain Effects on Distributional Semantic Modeling of Medical Terms

Overview
Journal Bioinformatics
Specialty Biology
Date 2016 Aug 18
PMID 27531100
Citations 42
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated.

Results: We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho = 0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho = 0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications.

Availability And Implementation: The software and reference standards used in this study to evaluate semantic similarity and relatedness measures are publicly available as detailed in the article.

Contact: pakh0002@umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Citing Articles

Clinical Relatedness and Stability of vigiVec Semantic Vector Representations of Adverse Events and Drugs in Pharmacovigilance.

Erlanson N, Felix China J, Taavola H, Noren G Drug Saf. 2025; 48(4):401-413.

PMID: 39833656 PMC: 11903574. DOI: 10.1007/s40264-024-01509-2.


Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.

Parwez M, Fazil M, Arif M, Nafis M, Auwul M Comput Intell Neurosci. 2024; 2023:2989791.

PMID: 39262497 PMC: 11390191. DOI: 10.1155/2023/2989791.


Extracting Complementary and Integrative Health Approaches in Electronic Health Records.

Zhou H, Silverman G, Niu Z, Silverman J, Evans R, Austin R J Healthc Inform Res. 2023; 7(3):277-290.

PMID: 37637720 PMC: 10449701. DOI: 10.1007/s41666-023-00137-2.


Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data.

Welvaars K, Oosterhoff J, van den Bekerom M, Doornberg J, van Haarst E JAMIA Open. 2023; 6(2):ooad033.

PMID: 37266187 PMC: 10232287. DOI: 10.1093/jamiaopen/ooad033.


Validating the representation of distance between infarct diseases using word embedding.

Yokokawa D, Noda K, Yanagita Y, Uehara T, Ohira Y, Shikino K BMC Med Inform Decis Mak. 2022; 22(1):322.

PMID: 36476486 PMC: 9730570. DOI: 10.1186/s12911-022-02061-8.


References
1.
Pedersen T, Pakhomov S, Patwardhan S, Chute C . Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform. 2006; 40(3):288-99. DOI: 10.1016/j.jbi.2006.06.004. View

2.
Bazarian J, Veazie P, Mookerjee S, Brooke Lerner E . Accuracy of mild traumatic brain injury case ascertainment using ICD-9 codes. Acad Emerg Med. 2005; 13(1):31-8. DOI: 10.1197/j.aem.2005.07.038. View

3.
Garla V, Brandt C . Semantic similarity in the biomedical domain: an evaluation across knowledge sources. BMC Bioinformatics. 2012; 13:261. PMC: 3533586. DOI: 10.1186/1471-2105-13-261. View

4.
Fan J, Arruda-Olson A, Leibson C, Smith C, Liu G, Bailey K . Billing code algorithms to identify cases of peripheral artery disease from administrative data. J Am Med Inform Assoc. 2013; 20(e2):e349-54. PMC: 3861931. DOI: 10.1136/amiajnl-2013-001827. View

5.
Pakhomov S, Weston S, Jacobsen S, Chute C, Meverden R, Roger V . Electronic medical records for clinical research: application to the identification of heart failure. Am J Manag Care. 2007; 13(6 Part 1):281-8. View