Large Language Models' Responses to Liver Cancer Surveillance, Diagnosis, and Management Questions: Accuracy, Reliability, Readability
Overview
Radiology
Authors
Affiliations
Purpose: To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.
Methods: Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.
Results: Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).
Conclusion: Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
Revolutionizing MASLD: How Artificial Intelligence Is Shaping the Future of Liver Care.
Pugliese N, Bertazzoni A, Hassan C, Schattenberg J, Aghemo A Cancers (Basel). 2025; 17(5).
PMID: 40075570 PMC: 11899536. DOI: 10.3390/cancers17050722.
Application of large language models in disease diagnosis and treatment.
Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y Chin Med J (Engl). 2024; 138(2):130-142.
PMID: 39722188 PMC: 11745858. DOI: 10.1097/CM9.0000000000003456.