Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition
Overview
Affiliations
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult.
Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
Availability And Implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ .
Contact: habibima@informatik.hu-berlin.de.
Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.
Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L J Healthc Inform Res. 2024; 8(4):658-711.
PMID: 39463859 PMC: 11499577. DOI: 10.1007/s41666-024-00171-8.
AI-Based Knowledge Extraction from the Bioprinting Literature for Identifying Technology Trends.
Bonatti A, Chiarello F, Vozzi G, De Maria C 3D Print Addit Manuf. 2024; 11(4):1495-1509.
PMID: 39360130 PMC: 11443122. DOI: 10.1089/3dp.2022.0316.
Jonker R, Almeida T, Antunes R, Almeida J, Matos S Database (Oxford). 2024; 2024.
PMID: 39083461 PMC: 11290360. DOI: 10.1093/database/baae068.
Towards discovery: an end-to-end system for uncovering novel biomedical relations.
Almeida T, Jonker R, Antunes R, Almeida J, Matos S Database (Oxford). 2024; 2024.
PMID: 38994795 PMC: 11240158. DOI: 10.1093/database/baae057.
Jang D, Park H, Son J, Hwang H, Kim S, Choi J AMIA Jt Summits Transl Sci Proc. 2024; 2024:249-257.
PMID: 38827054 PMC: 11141853.