» Articles » PMID: 31076572

BioWordVec, improving Biomedical Word Embeddings with Subword Information and MeSH

Overview
Journal Sci Data
Specialty Science
Date 2019 May 12
PMID 31076572
Citations 109
Authors
Affiliations
Soon will be listed here.
Abstract

Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.

Citing Articles

Ontology-guided machine learning outperforms zero-shot foundation models for cardiac ultrasound text reports.

Subramaniam S, Rizvi S, Ramesh R, Sehgal V, Gurusamy B, Arif H Sci Rep. 2025; 15(1):5456.

PMID: 39953053 PMC: 11828978. DOI: 10.1038/s41598-024-83540-y.


Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction.

Zhan X, Xu Q, Zheng Y, Lu G, Gevaert O PLoS Comput Biol. 2025; 21(2):e1012803.

PMID: 39946419 PMC: 11870354. DOI: 10.1371/journal.pcbi.1012803.


Predicting accrual success for better clinical trial resource allocation.

Ma S, Wang Y, Wagner J, Johnson S, Pakhomov S, Aliferis C Sci Rep. 2025; 15(1):3879.

PMID: 39890973 PMC: 11785987. DOI: 10.1038/s41598-025-88400-x.


Clinical Relatedness and Stability of vigiVec Semantic Vector Representations of Adverse Events and Drugs in Pharmacovigilance.

Erlanson N, Felix China J, Taavola H, Noren G Drug Saf. 2025; 48(4):401-413.

PMID: 39833656 PMC: 11903574. DOI: 10.1007/s40264-024-01509-2.


Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity.

Raciti D, Van Auken K, Arnaboldi V, Tabone C, Muller H, Sternberg P bioRxiv. 2025; .

PMID: 39829858 PMC: 11741306. DOI: 10.1101/2025.01.06.631539.


References
1.
Pyysalo S, Airola A, Heimonen J, Bjorne J, Ginter F, Salakoski T . Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics. 2008; 9 Suppl 3:S6. PMC: 2349296. DOI: 10.1186/1471-2105-9-S3-S6. View

2.
Peng Y, Arighi C, Wu C, Vijay-Shanker K . BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph. Database (Oxford). 2016; 2016. PMC: 4915133. DOI: 10.1093/database/baw072. View

3.
Rinaldi F, Lithgow O, Gama-Castro S, Solano H, Lopez A, Muniz Rascado L . Strategies towards digital and semi-automated curation in RegulonDB. Database (Oxford). 2017; 2017(1). PMC: 5467564. DOI: 10.1093/database/bax012. View

4.
Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton G . Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. AMIA Annu Symp Proc. 2011; 2010:572-6. PMC: 3041430. View

5.
Fundel K, Kuffner R, Zimmer R . RelEx--relation extraction using dependency parse trees. Bioinformatics. 2006; 23(3):365-71. DOI: 10.1093/bioinformatics/btl616. View