» Articles » PMID: 39088019

Large Language Models' Responses to Liver Cancer Surveillance, Diagnosis, and Management Questions: Accuracy, Reliability, Readability

Overview
Publisher Springer
Date 2024 Aug 1
PMID 39088019
Authors
Affiliations
Soon will be listed here.
Abstract

Purpose: To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

Methods: Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.

Results: Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

Conclusion: Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

Citing Articles

Revolutionizing MASLD: How Artificial Intelligence Is Shaping the Future of Liver Care.

Pugliese N, Bertazzoni A, Hassan C, Schattenberg J, Aghemo A Cancers (Basel). 2025; 17(5).

PMID: 40075570 PMC: 11899536. DOI: 10.3390/cancers17050722.


Application of large language models in disease diagnosis and treatment.

Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y Chin Med J (Engl). 2024; 138(2):130-142.

PMID: 39722188 PMC: 11745858. DOI: 10.1097/CM9.0000000000003456.

References
1.
Gulati R, Nawaz M, Pyrsopoulos N . Health literacy and liver disease. Clin Liver Dis (Hoboken). 2019; 11(2):48-51. PMC: 6314282. DOI: 10.1002/cld.690. View

2.
Haver H, Ambinder E, Bahl M, Oluyemi E, Jeudy J, Yi P . Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology. 2023; 307(4):e230424. DOI: 10.1148/radiol.230424. View

3.
Cao J, Kwon D, Ghaziani T, Kwo P, Tse G, Kesselman A . Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis. AJR Am J Roentgenol. 2023; 221(4):556-559. DOI: 10.2214/AJR.23.29493. View

4.
Yeo Y, Samaan J, Ng W, Ting P, Trivedi H, Vipani A . Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023; 29(3):721-732. PMC: 10366809. DOI: 10.3350/cmh.2023.0089. View

5.
Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L . Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023; 329(10):842-844. PMC: 10015303. DOI: 10.1001/jama.2023.1044. View