Toward Expert-level Medical Question Answering with Large Language Models
Overview
Authors
Affiliations
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
Giacobbe D, Marelli C, La Manna B, Padua D, Malva A, Guastavino S NPJ Antimicrob Resist. 2025; 3(1):14.
PMID: 40016394 PMC: 11868396. DOI: 10.1038/s44259-025-00084-5.
Huntsman D, Bulaj G Int J Environ Res Public Health. 2025; 22(2).
PMID: 40003451 PMC: 11855921. DOI: 10.3390/ijerph22020225.
Alba C, Xue B, Abraham J, Kannampallil T, Lu C NPJ Digit Med. 2025; 8(1):95.
PMID: 39934379 PMC: 11814325. DOI: 10.1038/s41746-025-01489-2.
Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.
Xu X, Yao B, Dong Y, Gabriel S, Yu H, Hendler J Proc ACM Interact Mob Wearable Ubiquitous Technol. 2025; 8(1).
PMID: 39925940 PMC: 11806945. DOI: 10.1145/3643540.
Khoylyan A, Salvato J, Vazquez F, Girgis M, Tang A, Chen T N Am Spine Soc J. 2025; 21:100580.
PMID: 39911377 PMC: 11795085. DOI: 10.1016/j.xnsj.2024.100580.