» Articles » PMID: 23043124

Using Rule-based Natural Language Processing to Improve Disease Normalization in Biomedical Text

Overview
Date 2012 Oct 9
PMID 23043124
Citations 37
Authors
Affiliations
Soon will be listed here.
Abstract

Background And Objective: In order for computers to extract useful information from unstructured text, a concept normalization system is needed to link relevant concepts in a text to sources that contain further information about the concept. Popular concept normalization tools in the biomedical field are dictionary-based. In this study we investigate the usefulness of natural language processing (NLP) as an adjunct to dictionary-based concept normalization.

Methods: We compared the performance of two biomedical concept normalization systems, MetaMap and Peregrine, on the Arizona Disease Corpus, with and without the use of a rule-based NLP module. Performance was assessed for exact and inexact boundary matching of the system annotations with those of the gold standard and for concept identifier matching.

Results: Without the NLP module, MetaMap and Peregrine attained F-scores of 61.0% and 63.9%, respectively, for exact boundary matching, and 55.1% and 56.9% for concept identifier matching. With the aid of the NLP module, the F-scores of MetaMap and Peregrine improved to 73.3% and 78.0% for boundary matching, and to 66.2% and 69.8% for concept identifier matching. For inexact boundary matching, performances further increased to 85.5% and 85.4%, and to 73.6% and 73.3% for concept identifier matching.

Conclusions: We have shown the added value of NLP for the recognition and normalization of diseases with MetaMap and Peregrine. The NLP module is general and can be applied in combination with any concept normalization system. Whether its use for concept types other than disease is equally advantageous remains to be investigated.

Citing Articles

PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking.

Xie Y, Lu J, Ho J, Nahab F, Hu X, Yang C Int ACM SIGIR Conf Res Dev Inf Retr. 2025; 2024:2589-2593.

PMID: 40018364 PMC: 11867735. DOI: 10.1145/3626772.3657904.


Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models.

Li J, Li Y, Pan Y, Guo J, Sun Z, Li F J Biomed Semantics. 2024; 15(1):14.

PMID: 39123237 PMC: 11316402. DOI: 10.1186/s13326-024-00318-x.


Thyroid Ultrasound Appropriateness Identification Through Natural Language Processing of Electronic Health Records.

Jacome C, Segura Torres D, Fan J, Loor-Torres R, Duran M, Al Zahidy M Mayo Clin Proc Digit Health. 2024; 2(1):67-74.

PMID: 38501072 PMC: 10947349. DOI: 10.1016/j.mcpdig.2024.01.001.


Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models.

Li J, Li Y, Pan Y, Guo J, Sun Z, Li F Res Sq. 2023; .

PMID: 37841880 PMC: 10571639. DOI: 10.21203/rs.3.rs-3362256/v1.


B-LBConA: a medical entity disambiguation model based on Bio-LinkBERT and context-aware mechanism.

Yang S, Zhang P, Che C, Zhong Z BMC Bioinformatics. 2023; 24(1):97.

PMID: 36927359 PMC: 10021986. DOI: 10.1186/s12859-023-05209-z.


References
1.
Bodenreider O . The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2003; 32(Database issue):D267-70. PMC: 308795. DOI: 10.1093/nar/gkh061. View

2.
Ratkovic Z, Golik W, Warnier P . Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach. BMC Bioinformatics. 2012; 13 Suppl 11:S8. PMC: 3384252. DOI: 10.1186/1471-2105-13-S11-S8. View

3.
Okazaki N, Ananiadou S, Tsujii J . Building a high-quality sense inventory for improved abbreviation disambiguation. Bioinformatics. 2010; 26(9):1246-53. PMC: 2859134. DOI: 10.1093/bioinformatics/btq129. View

4.
Morgan A, Lu Z, Wang X, Cohen A, Fluck J, Ruch P . Overview of BioCreative II gene normalization. Genome Biol. 2008; 9 Suppl 2:S3. PMC: 2559987. DOI: 10.1186/gb-2008-9-s2-s3. View

5.
Baumgartner Jr W, Lu Z, Johnson H, Caporaso J, Paquette J, Lindemann A . Concept recognition for extracting protein interaction relations from biomedical text. Genome Biol. 2008; 9 Suppl 2:S9. PMC: 2559993. DOI: 10.1186/gb-2008-9-s2-s9. View