Large Language Models' Responses to Liver Cancer Surveillance, Diagnosis, and Management Questions: Accuracy, Reliability, Readability

Overview

Journal Abdom Radiol (NY)

Publisher Springer

Specialties Gastroenterology
Radiology

Date 2024 Aug 1

PMID 39088019

Authors

Jennie J Cao

Daniel H Kwon

Tara T Ghaziani

Paul Kwo

Gary Tse

Andrew Kesselman

Aya Kamaya

Justin R Tse

Affiliations

Soon will be listed here.

Abstract

Purpose: To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

Methods: Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

Results: Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

Conclusion: Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

Citing Articles

Revolutionizing MASLD: How Artificial Intelligence Is Shaping the Future of Liver Care.

Pugliese N, Bertazzoni A, Hassan C, Schattenberg J, Aghemo A Cancers (Basel). 2025; 17(5).

PMID: 40075570 PMC: 11899536. DOI: 10.3390/cancers17050722.

Application of large language models in disease diagnosis and treatment.

Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y Chin Med J (Engl). 2024; 138(2):130-142.

PMID: 39722188 PMC: 11745858. DOI: 10.1097/CM9.0000000000003456.

References

Gulati R, Nawaz M, Pyrsopoulos N . Health literacy and liver disease. Clin Liver Dis (Hoboken). 2019; 11(2):48-51. PMC: 6314282. DOI: 10.1002/cld.690. View

Haver H, Ambinder E, Bahl M, Oluyemi E, Jeudy J, Yi P . Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology. 2023; 307(4):e230424. DOI: 10.1148/radiol.230424. View

Cao J, Kwon D, Ghaziani T, Kwo P, Tse G, Kesselman A . Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis. AJR Am J Roentgenol. 2023; 221(4):556-559. DOI: 10.2214/AJR.23.29493. View

Yeo Y, Samaan J, Ng W, Ting P, Trivedi H, Vipani A . Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023; 29(3):721-732. PMC: 10366809. DOI: 10.3350/cmh.2023.0089. View

Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L . Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023; 329(10):842-844. PMC: 10015303. DOI: 10.1001/jama.2023.1044. View

Li H, Moon J, Iyer D, Balthazar P, Krupinski E, Bercu Z . Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging. 2023; 101:137-141. DOI: 10.1016/j.clinimag.2023.06.008. View

Roberts R, Ali S, Hutchings H, Dobbs T, Whitaker I . Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards. BMJ Health Care Inform. 2023; 30(1). PMC: 10583079. DOI: 10.1136/bmjhci-2023-100830. View

Gebrael G, Sahu K, Chigarira B, Tripathi N, Mathew Thomas V, Sayegh N . Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0. Cancers (Basel). 2023; 15(14). PMC: 10378202. DOI: 10.3390/cancers15143717. View

Gilson A, Safranek C, Huang T, Socrates V, Chi L, Taylor R . How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023; 9:e45312. PMC: 9947764. DOI: 10.2196/45312. View

10.

Kung T, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C . Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023; 2(2):e0000198. PMC: 9931230. DOI: 10.1371/journal.pdig.0000198. View

11.

Rahsepar A, Tavakoli N, Kim G, Hassani C, Abtin F, Bedayat A . How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023; 307(5):e230922. DOI: 10.1148/radiol.230922. View

12.

Haver H, Lin C, Sirajuddin A, Yi P, Jeudy J . Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT's Answers to Common Questions About Lung Cancer and Lung Cancer Screening. AJR Am J Roentgenol. 2023; 221(5):701-704. DOI: 10.2214/AJR.23.29622. View

13.

Bhayana R, Krishna S, Bleakney R . Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. 2023; 307(5):e230582. DOI: 10.1148/radiol.230582. View

14.

Stossel L, Segar N, Gliatto P, Fallar R, Karani R . Readability of patient education materials available at the point of care. J Gen Intern Med. 2012; 27(9):1165-70. PMC: 3514986. DOI: 10.1007/s11606-012-2046-0. View