» Articles » PMID: 39779926

Toward Expert-level Medical Question Answering with Large Language Models

Abstract

Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a 'passing' score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.

Citing Articles

Advantages and limitations of large language models for antibiotic prescribing and antimicrobial stewardship.

Giacobbe D, Marelli C, La Manna B, Padua D, Malva A, Guastavino S NPJ Antimicrob Resist. 2025; 3(1):14.

PMID: 40016394 PMC: 11868396. DOI: 10.1038/s44259-025-00084-5.


Home Environment as a Therapeutic Target for Prevention and Treatment of Chronic Diseases: Delivering Restorative Living Spaces, Patient Education and Self-Care by Bridging Biophilic Design, E-Commerce and Digital Health Technologies.

Huntsman D, Bulaj G Int J Environ Res Public Health. 2025; 22(2).

PMID: 40003451 PMC: 11855921. DOI: 10.3390/ijerph22020225.


The foundational capabilities of large language models in predicting postoperative risks using clinical notes.

Alba C, Xue B, Abraham J, Kannampallil T, Lu C NPJ Digit Med. 2025; 8(1):95.

PMID: 39934379 PMC: 11814325. DOI: 10.1038/s41746-025-01489-2.


Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.

Xu X, Yao B, Dong Y, Gabriel S, Yu H, Hendler J Proc ACM Interact Mob Wearable Ubiquitous Technol. 2025; 8(1).

PMID: 39925940 PMC: 11806945. DOI: 10.1145/3643540.


Evaluation of GPT-4 concordance with north American spine society guidelines for lumbar fusion surgery.

Khoylyan A, Salvato J, Vazquez F, Girgis M, Tang A, Chen T N Am Spine Soc J. 2025; 21:100580.

PMID: 39911377 PMC: 11795085. DOI: 10.1016/j.xnsj.2024.100580.


References
1.
Singhal K, Azizi S, Tu T, Mahdavi S, Wei J, Chung H . Large language models encode clinical knowledge. Nature. 2023; 620(7972):172-180. PMC: 10396962. DOI: 10.1038/s41586-023-06291-2. View

2.
Lievin V, Hother C, Motzfeldt A, Winther O . Can large language models reason about medical questions?. Patterns (N Y). 2024; 5(3):100943. PMC: 10935498. DOI: 10.1016/j.patter.2024.100943. View

3.
Shortliffe E . Computer programs to support clinical decision making. JAMA. 1987; 258(1):61-6. View

4.
Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H . BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022; 23(6). DOI: 10.1093/bib/bbac409. View

5.
Levine D, Tuwani R, Kompa B, Varma A, Finlayson S, Mehrotra A . The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study. Lancet Digit Health. 2024; 6(8):e555-e561. DOI: 10.1016/S2589-7500(24)00097-9. View