Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models
Overview
Affiliations
Purpose: Phenotyping is critical for informing rare disease diagnosis and treatment, but disease phenotypes are often embedded in unstructured text. While natural language processing (NLP) can automate extraction, a major bottleneck is developing annotated corpora. Recently, prompt learning with large language models (LLMs) has been shown to lead to generalizable results without any (zero-shot) or few annotated samples (few-shot), but none have explored this for rare diseases. Our work is the first to study prompt learning for identifying and extracting rare disease phenotypes in the zero- and few-shot settings.
Methods: We compared the performance of prompt learning with ChatGPT and fine-tuning with BioClinicalBERT. We engineered novel prompts for ChatGPT to identify and extract rare diseases and their phenotypes (e.g., diseases, symptoms, and signs), established a benchmark for evaluating its performance, and conducted an in-depth error analysis.
Results: Overall, fine-tuning BioClinicalBERT resulted in higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.610 in the zero- and few-shot settings, respectively). However, ChatGPT achieved higher accuracy for rare diseases and signs in the one-shot setting (F1 of 0.778 and 0.725). Conversational, sentence-based prompts generally achieved higher accuracy than structured lists.
Conclusion: Prompt learning using ChatGPT has the potential to match or outperform fine-tuning BioClinicalBERT at extracting rare diseases and signs with just one annotated sample. Given its accessibility, ChatGPT could be leveraged to extract these entities without relying on a large, annotated corpus. While LLMs can support rare disease phenotyping, researchers should critically evaluate model outputs to ensure phenotyping accuracy.
Cao L, Sun J, Cross A JMIR Med Inform. 2024; 12:e60665.
PMID: 39693482 PMC: 11683654. DOI: 10.2196/60665.
Lee K, Paek H, Huang L, Hilton C, Datta S, Higashi J Inform Med Unlocked. 2024; 50.
PMID: 39493413 PMC: 11530223. DOI: 10.1016/j.imu.2024.101589.
Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.
Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L J Healthc Inform Res. 2024; 8(4):658-711.
PMID: 39463859 PMC: 11499577. DOI: 10.1007/s41666-024-00171-8.
A hybrid framework with large language models for rare disease phenotyping.
Wu J, Dong H, Li Z, Wang H, Li R, Patra A BMC Med Inform Decis Mak. 2024; 24(1):289.
PMID: 39375687 PMC: 11460004. DOI: 10.1186/s12911-024-02698-7.