An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks

Overview

Journal Nat Med

Specialties General Medicine
Molecular Biology

Date 2025 Jan 3

PMID 39747685

Authors

Shreya Johri

Jaehwan Jeong

Benjamin A Tran

Daniel I Schlessinger

Shannon Wongvibulsin

Leandra A Barnes

Hong-Yu Zhou

Zhuo Ran Cai

Eliezer M Van Allen

David Kim

Roxana Daneshjou

Pranav Rajpurkar

Affiliations

Soon will be listed here.

Abstract

The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.

Citing Articles

Red teaming ChatGPT in medicine to yield real-world insights on model behavior.

Chang C, Farah H, Gui H, Rezaei S, Bou-Khalil C, Park Y NPJ Digit Med. 2025; 8(1):149.

PMID: 40055532 PMC: 11889229. DOI: 10.1038/s41746-025-01542-0.

References

Irving G, Neves A, Dambha-Miller H, Oishi A, Tagashira H, Verho A . International variations in primary care physician consultation time: a systematic review of 67 countries. BMJ Open. 2017; 7(10):e017902. PMC: 5695512. DOI: 10.1136/bmjopen-2017-017902. View

Chun Wong J, Vincent R, Al-Sharqi A . Dermatology consultations: how long do they take?. Future Hosp J. 2019; 4(1):23-26. PMC: 6484168. DOI: 10.7861/futurehosp.4-1-23. View

Shaver J . The State of Telehealth Before and After the COVID-19 Pandemic. Prim Care. 2022; 49(4):517-530. PMC: 9035352. DOI: 10.1016/j.pop.2022.04.002. View

Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L . Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023; 329(10):842-844. PMC: 10015303. DOI: 10.1001/jama.2023.1044. View

Rajpurkar P, Chen E, Banerjee O, Topol E . AI in health and medicine. Nat Med. 2022; 28(1):31-38. DOI: 10.1038/s41591-021-01614-0. View

Lee P, Bubeck S, Petro J . Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023; 388(13):1233-1239. DOI: 10.1056/NEJMsr2214184. View

Moor M, Banerjee O, Shakeri Hossein Abad Z, Krumholz H, Leskovec J, Topol E . Foundation models for generalist medical artificial intelligence. Nature. 2023; 616(7956):259-265. DOI: 10.1038/s41586-023-05881-4. View

Ayers J, Poliak A, Dredze M, Leas E, Zhu Z, Kelley J . Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023; 183(6):589-596. PMC: 10148230. DOI: 10.1001/jamainternmed.2023.1838. View

Au Yeung J, Kraljevic Z, Luintel A, Balston A, Idowu E, Dobson R . AI chatbots not yet ready for clinical use. Front Digit Health. 2023; 5:1161098. PMC: 10130576. DOI: 10.3389/fdgth.2023.1161098. View

10.

Wornow M, Xu Y, Thapa R, Patel B, Steinberg E, Fleming S . The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023; 6(1):135. PMC: 10387101. DOI: 10.1038/s41746-023-00879-8. View

11.

Shah N, Entwistle D, Pfeffer M . Creation and Adoption of Large Language Models in Medicine. JAMA. 2023; 330(9):866-869. DOI: 10.1001/jama.2023.14217. View

12.

Ali R, Tang O, Connolly I, Fridley J, Shin J, Zadnik Sullivan P . Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Neurosurgery. 2023; 93(5):1090-1098. DOI: 10.1227/neu.0000000000002551. View

13.

Fijacko N, Gosak L, Stiglic G, Picard C, John Douma M . Can ChatGPT pass the life support exams without entering the American heart association course?. Resuscitation. 2023; 185:109732. DOI: 10.1016/j.resuscitation.2023.109732. View

14.

Kung T, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C . Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023; 2(2):e0000198. PMC: 9931230. DOI: 10.1371/journal.pdig.0000198. View

15.

Strong E, DiGiammarino A, Weng Y, Kumar A, Hosamani P, Hom J . Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations. JAMA Intern Med. 2023; 183(9):1028-1030. PMC: 10352923. DOI: 10.1001/jamainternmed.2023.2909. View

16.

Lowell B, Froelich C, Federman D, Kirsner R . Dermatology in primary care: Prevalence and patient disposition. J Am Acad Dermatol. 2001; 45(2):250-5. DOI: 10.1067/mjd.2001.114598. View

17.

Takagi S, Watari T, Erabi A, Sakaguchi K . Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ. 2023; 9:e48002. PMC: 10365615. DOI: 10.2196/48002. View

18.

Lin J, Younessi D, Kurapati S, Tang O, Scott I . Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye (Lond). 2023; 37(17):3694-3695. PMC: 10686407. DOI: 10.1038/s41433-023-02564-2. View

19.

Giannos P . Evaluating the limits of AI in medical specialisation: ChatGPT's performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol Open. 2023; 5(1):e000451. PMC: 10277081. DOI: 10.1136/bmjno-2023-000451. View

20.

Moshirfar M, Altaf A, Stoakes I, Tuttle J, Hoopes P . Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus. 2023; 15(6):e40822. PMC: 10362981. DOI: 10.7759/cureus.40822. View