» Articles » PMID: 38681753

Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models

Overview
Date 2024 Apr 29
PMID 38681753
Authors
Affiliations
Soon will be listed here.
Abstract

Purpose: Phenotyping is critical for informing rare disease diagnosis and treatment, but disease phenotypes are often embedded in unstructured text. While natural language processing (NLP) can automate extraction, a major bottleneck is developing annotated corpora. Recently, prompt learning with large language models (LLMs) has been shown to lead to generalizable results without any (zero-shot) or few annotated samples (few-shot), but none have explored this for rare diseases. Our work is the first to study prompt learning for identifying and extracting rare disease phenotypes in the zero- and few-shot settings.

Methods: We compared the performance of prompt learning with ChatGPT and fine-tuning with BioClinicalBERT. We engineered novel prompts for ChatGPT to identify and extract rare diseases and their phenotypes (e.g., diseases, symptoms, and signs), established a benchmark for evaluating its performance, and conducted an in-depth error analysis.

Results: Overall, fine-tuning BioClinicalBERT resulted in higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.610 in the zero- and few-shot settings, respectively). However, ChatGPT achieved higher accuracy for rare diseases and signs in the one-shot setting (F1 of 0.778 and 0.725). Conversational, sentence-based prompts generally achieved higher accuracy than structured lists.

Conclusion: Prompt learning using ChatGPT has the potential to match or outperform fine-tuning BioClinicalBERT at extracting rare diseases and signs with just one annotated sample. Given its accessibility, ChatGPT could be leveraged to extract these entities without relying on a large, annotated corpus. While LLMs can support rare disease phenotyping, researchers should critically evaluate model outputs to ensure phenotyping accuracy.

Citing Articles

An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.

Cao L, Sun J, Cross A JMIR Med Inform. 2024; 12:e60665.

PMID: 39693482 PMC: 11683654. DOI: 10.2196/60665.


SEETrials: Leveraging large language models for safety and efficacy extraction in oncology clinical trials.

Lee K, Paek H, Huang L, Hilton C, Datta S, Higashi J Inform Med Unlocked. 2024; 50.

PMID: 39493413 PMC: 11530223. DOI: 10.1016/j.imu.2024.101589.


Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.

Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L J Healthc Inform Res. 2024; 8(4):658-711.

PMID: 39463859 PMC: 11499577. DOI: 10.1007/s41666-024-00171-8.


A hybrid framework with large language models for rare disease phenotyping.

Wu J, Dong H, Li Z, Wang H, Li R, Patra A BMC Med Inform Decis Mak. 2024; 24(1):289.

PMID: 39375687 PMC: 11460004. DOI: 10.1186/s12911-024-02698-7.

References
1.
Segura-Bedmar I, Camino-Perdones D, Guerrero-Aspizua S . Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts. BMC Bioinformatics. 2022; 23(1):263. PMC: 9258216. DOI: 10.1186/s12859-022-04810-y. View

2.
Carmichael N, Tsipis J, Windmueller G, Mandel L, Estrella E . "Is it going to hurt?": the impact of the diagnostic odyssey on children and their families. J Genet Couns. 2014; 24(2):325-35. DOI: 10.1007/s10897-014-9773-9. View

3.
Ahmad F, Ricket I, Hammill B, Eskenazi L, Robertson H, Curtis L . Computable Phenotype Implementation for a National, Multicenter Pragmatic Clinical Trial: Lessons Learned From ADAPTABLE. Circ Cardiovasc Qual Outcomes. 2020; 13(6):e006292. PMC: 7321832. DOI: 10.1161/CIRCOUTCOMES.119.006292. View

4.
Fabregat H, Araujo L, Martinez-Romo J . Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases. Comput Methods Programs Biomed. 2018; 164:121-129. DOI: 10.1016/j.cmpb.2018.07.007. View

5.
Chung C, Chu A, Chung B . Rare disease emerging as a global public health priority. Front Public Health. 2022; 10:1028545. PMC: 9632971. DOI: 10.3389/fpubh.2022.1028545. View