Document-level Attention-based BiLSTM-CRF Incorporating Disease Dictionary for Disease Named Entity Recognition
Overview
General Medicine
Medical Informatics
Authors
Affiliations
Background: Disease named entity recognition (NER) plays an important role in biomedical research. There are a significant number of challenging issues to be addressed; among these, the identification of rare diseases and complex disease names and the problem of tagging inconsistency (i.e., if an entity is tagged differently in a document) are attracting substantial research attention.
Methods: We propose a new neural network method named Dic-Att-BiLSTM-CRF (DABLC) for disease NER. DABLC applies an efficient exact string matching method to match disease entities with a disease dictionary; here, the dictionary is constructed based on the Disease Ontology. Furthermore, DABLC constructs a dictionary attention layer by incorporating a disease dictionary matching method and document-level attention mechanism. Finally, a bidirectional long short-term memory network and conditional random field (BiLSTM-CRF) with a dictionary attention layer is proposed to combine the disease dictionary to develop disease NER.
Results: Extensive experiments are conducted on two widely-used corpora: the NCBI disease corpus and the BioCreative V CDR corpus. We apply each test on 10 executions of each model, with a 95% confidence interval. DABLC achieves the highest F1 scores (NCBI: Precision = 0.883, Recall = 0.89, F1 = 0.886; BioCreative V CDR: Precision = 0.891, Recall = 0.875, F1 = 0.883), outperforming the state-of-the-art methods.
Conclusion: DABLC combines the advantages of both external dictionary resources and deep attention neural networks. This aids the identification of rare diseases and complex disease names; moreover, it reduces the impact of tagging inconsistency. Special disease NER and deep learning models addressing long sentences are noteworthy areas for future examination.
Tang J, Huang Z, Xu H, Zhang H, Huang H, Tang M JMIR Med Inform. 2024; 12:e60334.
PMID: 39622697 PMC: 11612518. DOI: 10.2196/60334.
Comparative Analysis of Large Language Models in Chinese Medical Named Entity Recognition.
Zhu Z, Zhao Q, Li J, Ge Y, Ding X, Gu T Bioengineering (Basel). 2024; 11(10).
PMID: 39451358 PMC: 11504658. DOI: 10.3390/bioengineering11100982.
Huang D, Zeng Q, Xiong Y, Liu S, Pang C, Xia M Interdiscip Sci. 2024; 16(2):333-344.
PMID: 38340264 PMC: 11289304. DOI: 10.1007/s12539-024-00605-2.
Raza S, Schwartz B BMC Med Inform Decis Mak. 2023; 23(1):20.
PMID: 36703154 PMC: 9879259. DOI: 10.1186/s12911-023-02117-3.
Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora.
Gajendran S, Manjula D, Sugumaran V, Hema R Comput Biol Chem. 2023; 102:107808.
PMID: 36621289 PMC: 9807269. DOI: 10.1016/j.compbiolchem.2022.107808.