Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions
Overview
Authors
Affiliations
Importance: Large language models (LLMs) like ChatGPT appear capable of performing a variety of tasks, including answering patient eye care questions, but have not yet been evaluated in direct comparison with ophthalmologists. It remains unclear whether LLM-generated advice is accurate, appropriate, and safe for eye patients.
Objective: To evaluate the quality of ophthalmology advice generated by an LLM chatbot in comparison with ophthalmologist-written advice.
Design, Setting, And Participants: This cross-sectional study used deidentified data from an online medical forum, in which patient questions received responses written by American Academy of Ophthalmology (AAO)-affiliated ophthalmologists. A masked panel of 8 board-certified ophthalmologists were asked to distinguish between answers generated by the ChatGPT chatbot and human answers. Posts were dated between 2007 and 2016; data were accessed January 2023 and analysis was performed between March and May 2023.
Main Outcomes And Measures: Identification of chatbot and human answers on a 4-point scale (likely or definitely artificial intelligence [AI] vs likely or definitely human) and evaluation of responses for presence of incorrect information, alignment with perceived consensus in the medical community, likelihood to cause harm, and extent of harm.
Results: A total of 200 pairs of user questions and answers by AAO-affiliated ophthalmologists were evaluated. The mean (SD) accuracy for distinguishing between AI and human responses was 61.3% (9.7%). Of 800 evaluations of chatbot-written answers, 168 answers (21.0%) were marked as human-written, while 517 of 800 human-written answers (64.6%) were marked as AI-written. Compared with human answers, chatbot answers were more frequently rated as probably or definitely written by AI (prevalence ratio [PR], 1.72; 95% CI, 1.52-1.93). The likelihood of chatbot answers containing incorrect or inappropriate material was comparable with human answers (PR, 0.92; 95% CI, 0.77-1.10), and did not differ from human answers in terms of likelihood of harm (PR, 0.84; 95% CI, 0.67-1.07) nor extent of harm (PR, 0.99; 95% CI, 0.80-1.22).
Conclusions And Relevance: In this cross-sectional study of human-written and AI-generated responses to 200 eye care questions from an online advice forum, a chatbot appeared capable of responding to long user-written eye health posts and largely generated appropriate responses that did not differ significantly from ophthalmologist-written responses in terms of incorrect information, likelihood of harm, extent of harm, or deviation from ophthalmologist community standards. Additional research is needed to assess patient attitudes toward LLM-augmented ophthalmologists vs fully autonomous AI content generation, to evaluate clarity and acceptability of LLM-generated answers from the patient perspective, to test the performance of LLMs in a greater variety of clinical contexts, and to determine an optimal manner of utilizing LLMs that is ethical and minimizes harm.
Yu A, Li A, Ahmed W, Saturno M, Cho S Global Spine J. 2025; :21925682251321837.
PMID: 39959933 PMC: 11833805. DOI: 10.1177/21925682251321837.
Large Language Models for Chatbot Health Advice Studies: A Systematic Review.
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen J, McKechnie T JAMA Netw Open. 2025; 8(2):e2457879.
PMID: 39903463 PMC: 11795331. DOI: 10.1001/jamanetworkopen.2024.57879.
Current applications and challenges in large language models for patient care: a systematic review.
Busch F, Hoffmann L, Rueger C, van Dijk E, Kader R, Ortiz-Prado E Commun Med (Lond). 2025; 5(1):26.
PMID: 39838160 PMC: 11751060. DOI: 10.1038/s43856-024-00717-2.
Assessing the possibility of using large language models in ocular surface diseases.
Ling Q, Xu Z, Zeng Y, Hong Q, Qian X, Hu J Int J Ophthalmol. 2025; 18(1):1-8.
PMID: 39829624 PMC: 11672086. DOI: 10.18240/ijo.2025.01.01.
Burgisser N, Chalot E, Mehouachi S, Buclin C, Lauper K, Courvoisier D RMD Open. 2025; 10(4).
PMID: 39794274 PMC: 11664341. DOI: 10.1136/rmdopen-2024-005003.