» Articles » PMID: 28881963

Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition

Overview
Journal Bioinformatics
Specialty Biology
Date 2017 Sep 9
PMID 28881963
Citations 106
Authors
Affiliations
Soon will be listed here.
Abstract

Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult.

Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.

Availability And Implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ .

Contact: habibima@informatik.hu-berlin.de.

Citing Articles

Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.

Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L J Healthc Inform Res. 2024; 8(4):658-711.

PMID: 39463859 PMC: 11499577. DOI: 10.1007/s41666-024-00171-8.


AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends.

Bonatti A, Chiarello F, Vozzi G, De Maria C 3D Print Addit Manuf. 2024; 11(4):1495-1509.

PMID: 39360130 PMC: 11443122. DOI: 10.1089/3dp.2022.0316.


Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes.

Jonker R, Almeida T, Antunes R, Almeida J, Matos S Database (Oxford). 2024; 2024.

PMID: 39083461 PMC: 11290360. DOI: 10.1093/database/baae068.


Towards discovery: an end-to-end system for uncovering novel biomedical relations.

Almeida T, Jonker R, Antunes R, Almeida J, Matos S Database (Oxford). 2024; 2024.

PMID: 38994795 PMC: 11240158. DOI: 10.1093/database/baae057.


Automated Information Extraction from Thyroid Operation Narrative: A Comparative Study of GPT-4 and Fine-tuned KoELECTRA.

Jang D, Park H, Son J, Hwang H, Kim S, Choi J AMIA Jt Summits Transl Sci Proc. 2024; 2024:249-257.

PMID: 38827054 PMC: 11141853.


References
1.
Kaewphan S, Van Landeghem S, Ohta T, Van de Peer Y, Ginter F, Pyysalo S . Cell line name recognition in support of the identification of synthetic lethality in cancer from text. Bioinformatics. 2015; 32(2):276-82. PMC: 4708107. DOI: 10.1093/bioinformatics/btv570. View

2.
Uzuner O, South B, Shen S, DuVall S . 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 2011; 18(5):552-6. PMC: 3168320. DOI: 10.1136/amiajnl-2011-000203. View

3.
Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z . The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform. 2015; 7:S2. PMC: 4331692. DOI: 10.1186/1758-2946-7-S1-S2. View

4.
Pyysalo S, Ginter F, Heimonen J, Bjorne J, Boberg J, Jarvinen J . BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics. 2007; 8:50. PMC: 1808065. DOI: 10.1186/1471-2105-8-50. View

5.
Furlong L, Dach H, Hofmann-Apitius M, Sanz F . OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics. 2008; 9:84. PMC: 2277400. DOI: 10.1186/1471-2105-9-84. View