Expansion of Medical Vocabularies Using Distributional Semantics on Japanese Patient Blogs

Overview

Journal J Biomed Semantics

Publisher Biomed Central

Specialties Biology
Biomedical Engineering

Date 2016 Sep 28

PMID 27671202

Citations 7

Authors

Magnus Ahltorp

Maria Skeppstedt

Shiho Kitajima

Aron Henriksson

Rafal Rzepka

Kenji Araki

Affiliations

Soon will be listed here.

Abstract

Background: Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs.

Methods: Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies.

Results: Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding.

Conclusions: Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.

Citing Articles

An Alternative Application of Natural Language Processing to Express a Characteristic Feature of Diseases in Japanese Medical Records.

Yamanouchi Y, Nakamura T, Ikeda T, Usuku K Methods Inf Med. 2023; 62(3-04):110-118.

PMID: 36809794 PMC: 10462427. DOI: 10.1055/a-2039-3773.

MedLexSp - a medical lexicon for Spanish medical natural language processing.

Campillos-Llanos L J Biomed Semantics. 2023; 14(1):2.

PMID: 36732862 PMC: 9892682. DOI: 10.1186/s13326-022-00281-5.

Affective Cognition of Students' Autonomous Learning in College English Teaching Based on Deep Learning.

Zhang D Front Psychol. 2022; 12:808434.

PMID: 35126258 PMC: 8808963. DOI: 10.3389/fpsyg.2021.808434.

Learning unsupervised contextual representations for medical synonym discovery.

Schumacher E, Dredze M JAMIA Open. 2020; 2(4):538-546.

PMID: 32025651 PMC: 6994012. DOI: 10.1093/jamiaopen/ooz057.

Clinical Natural Language Processing in languages other than English: opportunities and challenges.

Neveol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P J Biomed Semantics. 2018; 9(1):12.

PMID: 29602312 PMC: 5877394. DOI: 10.1186/s13326-018-0179-8.

References

Aronson A . Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2002; :17-21. PMC: 2243666. View

Cohen A, Hersh W, Dubay C, Spackman K . Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts. BMC Bioinformatics. 2005; 6:103. PMC: 1090552. DOI: 10.1186/1471-2105-6-103. View

Long W . Extracting diagnoses from discharge summaries. AMIA Annu Symp Proc. 2006; :470-4. PMC: 1560678. View

Henriksson A, Moen H, Skeppstedt M, Daudaravicius V, Duneld M . Synonym extraction and abbreviation expansion with ensembles of semantic spaces. J Biomed Semantics. 2014; 5(1):6. PMC: 3937097. DOI: 10.1186/2041-1480-5-6. View

Hartley D, Nelson N, Walters R, Arthur R, Yangarber R, Madoff L . Landscape of international event-based biosurveillance. Emerg Health Threats J. 2012; 3:e3. PMC: 3167659. DOI: 10.3134/ehtj.10.003. View

Friedman C, Shagina L, Lussier Y, Hripcsak G . Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004; 11(5):392-402. PMC: 516246. DOI: 10.1197/jamia.M1552. View

Leroy G, Endicott J, Mouradi O, Kauchak D, Just M . Improving perceived and actual text difficulty for health information consumers using semi-automated methods. AMIA Annu Symp Proc. 2013; 2012:522-31. PMC: 3540563. View

Huang Y, Lowe H, Hersh W . A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports. J Am Med Inform Assoc. 2003; 10(6):580-7. PMC: 264436. DOI: 10.1197/jamia.M1369. View

Roque F, Jensen P, Schmock H, Dalgaard M, Andreatta M, Hansen T . Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol. 2011; 7(8):e1002141. PMC: 3161904. DOI: 10.1371/journal.pcbi.1002141. View

10.

Albright D, Lanfranchi A, Fredriksen A, Styler 4th W, Warner C, Hwang J . Towards comprehensive syntactic and semantic annotations of the clinical narrative. J Am Med Inform Assoc. 2013; 20(5):922-30. PMC: 3756257. DOI: 10.1136/amiajnl-2012-001317. View

11.

Bodenreider O . The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2003; 32(Database issue):D267-70. PMC: 308795. DOI: 10.1093/nar/gkh061. View

12.

Uzuner O, Solti I, Cadag E . Extracting medication information from clinical text. J Am Med Inform Assoc. 2010; 17(5):514-8. PMC: 2995677. DOI: 10.1136/jamia.2010.003947. View

13.

Chapman W, Dowling J, Hripcsak G . Evaluation of training with an annotation schema for manual annotation of clinical conditions from emergency department reports. Int J Med Inform. 2007; 77(2):107-13. DOI: 10.1016/j.ijmedinf.2007.01.002. View

14.

Henriksson A, Kvist M, Dalianis H, Duneld M . Identifying adverse drug event information in clinical notes with distributional semantic representations of context. J Biomed Inform. 2015; 57:333-49. DOI: 10.1016/j.jbi.2015.08.013. View

15.

Chapman W, Christensen L, Wagner M, Haug P, Ivanov O, Dowling J . Classifying free-text triage chief complaints into syndromic categories with natural language processing. Artif Intell Med. 2004; 33(1):31-40. DOI: 10.1016/j.artmed.2004.04.001. View

16.

Cohen T, Widdows D . Empirical distributional semantics: methods and biomedical applications. J Biomed Inform. 2009; 42(2):390-405. PMC: 2750802. DOI: 10.1016/j.jbi.2009.02.002. View

17.

McCrae J, Collier N . Synonym set extraction from the biomedical literature by lexical pattern discovery. BMC Bioinformatics. 2008; 9:159. PMC: 2335115. DOI: 10.1186/1471-2105-9-159. View

18.

Nikfarjam A, Sarker A, OConnor K, Ginn R, Gonzalez G . Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc. 2015; 22(3):671-81. PMC: 4457113. DOI: 10.1093/jamia/ocu041. View

19.

Yu H, Agichtein E . Extracting synonymous gene and protein terms from biological literature. Bioinformatics. 2003; 19 Suppl 1:i340-9. DOI: 10.1093/bioinformatics/btg1047. View

20.

Zou Q, Chu W, Morioka C, Leazer G, Kangarloo H . IndexFinder: a method of extracting key concepts from clinical texts for indexing. AMIA Annu Symp Proc. 2004; :763-7. PMC: 1480259. View