» Articles » PMID: 39747685

An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks

Abstract

The integration of large language models (LLMs) into clinical diagnostics has the potential to transform doctor-patient interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD) approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical examinations, CRAFT-MD focuses on natural dialogues, using simulated artificial intelligence agents to interact with LLMs in a controlled environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4, GPT-3.5, Mistral and LLaMA-2-7b across 12 medical specialties. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy. These limitations also persisted when analyzing multimodal conversational and visual assessment capabilities of GPT-4V. We propose a comprehensive set of recommendations for future evaluations of clinical LLMs based on our empirical findings. These recommendations emphasize realistic doctor-patient conversations, comprehensive history-taking, open-ended questioning and using a combination of automated and expert evaluations. The introduction of CRAFT-MD marks an advancement in testing of clinical LLMs, aiming to ensure that these models augment medical practice effectively and ethically.

Citing Articles

Red teaming ChatGPT in medicine to yield real-world insights on model behavior.

Chang C, Farah H, Gui H, Rezaei S, Bou-Khalil C, Park Y NPJ Digit Med. 2025; 8(1):149.

PMID: 40055532 PMC: 11889229. DOI: 10.1038/s41746-025-01542-0.

References
1.
Irving G, Neves A, Dambha-Miller H, Oishi A, Tagashira H, Verho A . International variations in primary care physician consultation time: a systematic review of 67 countries. BMJ Open. 2017; 7(10):e017902. PMC: 5695512. DOI: 10.1136/bmjopen-2017-017902. View

2.
Chun Wong J, Vincent R, Al-Sharqi A . Dermatology consultations: how long do they take?. Future Hosp J. 2019; 4(1):23-26. PMC: 6484168. DOI: 10.7861/futurehosp.4-1-23. View

3.
Shaver J . The State of Telehealth Before and After the COVID-19 Pandemic. Prim Care. 2022; 49(4):517-530. PMC: 9035352. DOI: 10.1016/j.pop.2022.04.002. View

4.
Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L . Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023; 329(10):842-844. PMC: 10015303. DOI: 10.1001/jama.2023.1044. View

5.
Rajpurkar P, Chen E, Banerjee O, Topol E . AI in health and medicine. Nat Med. 2022; 28(1):31-38. DOI: 10.1038/s41591-021-01614-0. View