» Articles » PMID: 37606922

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions

Abstract

Importance: Large language models (LLMs) like ChatGPT appear capable of performing a variety of tasks, including answering patient eye care questions, but have not yet been evaluated in direct comparison with ophthalmologists. It remains unclear whether LLM-generated advice is accurate, appropriate, and safe for eye patients.

Objective: To evaluate the quality of ophthalmology advice generated by an LLM chatbot in comparison with ophthalmologist-written advice.

Design, Setting, And Participants: This cross-sectional study used deidentified data from an online medical forum, in which patient questions received responses written by American Academy of Ophthalmology (AAO)-affiliated ophthalmologists. A masked panel of 8 board-certified ophthalmologists were asked to distinguish between answers generated by the ChatGPT chatbot and human answers. Posts were dated between 2007 and 2016; data were accessed January 2023 and analysis was performed between March and May 2023.

Main Outcomes And Measures: Identification of chatbot and human answers on a 4-point scale (likely or definitely artificial intelligence [AI] vs likely or definitely human) and evaluation of responses for presence of incorrect information, alignment with perceived consensus in the medical community, likelihood to cause harm, and extent of harm.

Results: A total of 200 pairs of user questions and answers by AAO-affiliated ophthalmologists were evaluated. The mean (SD) accuracy for distinguishing between AI and human responses was 61.3% (9.7%). Of 800 evaluations of chatbot-written answers, 168 answers (21.0%) were marked as human-written, while 517 of 800 human-written answers (64.6%) were marked as AI-written. Compared with human answers, chatbot answers were more frequently rated as probably or definitely written by AI (prevalence ratio [PR], 1.72; 95% CI, 1.52-1.93). The likelihood of chatbot answers containing incorrect or inappropriate material was comparable with human answers (PR, 0.92; 95% CI, 0.77-1.10), and did not differ from human answers in terms of likelihood of harm (PR, 0.84; 95% CI, 0.67-1.07) nor extent of harm (PR, 0.99; 95% CI, 0.80-1.22).

Conclusions And Relevance: In this cross-sectional study of human-written and AI-generated responses to 200 eye care questions from an online advice forum, a chatbot appeared capable of responding to long user-written eye health posts and largely generated appropriate responses that did not differ significantly from ophthalmologist-written responses in terms of incorrect information, likelihood of harm, extent of harm, or deviation from ophthalmologist community standards. Additional research is needed to assess patient attitudes toward LLM-augmented ophthalmologists vs fully autonomous AI content generation, to evaluate clarity and acceptability of LLM-generated answers from the patient perspective, to test the performance of LLMs in a greater variety of clinical contexts, and to determine an optimal manner of utilizing LLMs that is ethical and minimizes harm.

Citing Articles

Evaluating Artificial Intelligence in Spinal Cord Injury Management: A Comparative Analysis of ChatGPT-4o and Google Gemini Against American College of Surgeons Best Practices Guidelines for Spine Injury.

Yu A, Li A, Ahmed W, Saturno M, Cho S Global Spine J. 2025; :21925682251321837.

PMID: 39959933 PMC: 11833805. DOI: 10.1177/21925682251321837.


Large Language Models for Chatbot Health Advice Studies: A Systematic Review.

Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen J, McKechnie T JAMA Netw Open. 2025; 8(2):e2457879.

PMID: 39903463 PMC: 11795331. DOI: 10.1001/jamanetworkopen.2024.57879.


Current applications and challenges in large language models for patient care: a systematic review.

Busch F, Hoffmann L, Rueger C, van Dijk E, Kader R, Ortiz-Prado E Commun Med (Lond). 2025; 5(1):26.

PMID: 39838160 PMC: 11751060. DOI: 10.1038/s43856-024-00717-2.


Assessing the possibility of using large language models in ocular surface diseases.

Ling Q, Xu Z, Zeng Y, Hong Q, Qian X, Hu J Int J Ophthalmol. 2025; 18(1):1-8.

PMID: 39829624 PMC: 11672086. DOI: 10.18240/ijo.2025.01.01.


Large language models for accurate disease detection in electronic health records: the examples of crystal arthropathies.

Burgisser N, Chalot E, Mehouachi S, Buclin C, Lauper K, Courvoisier D RMD Open. 2025; 10(4).

PMID: 39794274 PMC: 11664341. DOI: 10.1136/rmdopen-2024-005003.


References
1.
Van Bulck L, Moons P . What if your patient switches from Dr. Google to Dr. ChatGPT? A vignette-based survey of the trustworthiness, value, and danger of ChatGPT-generated responses to health questions. Eur J Cardiovasc Nurs. 2023; 23(1):95-98. DOI: 10.1093/eurjcn/zvad038. View

2.
Yan A, McAuley J, Lu X, Du J, Chang E, Gentili A . RadBERT: Adapting Transformer-based Language Models to Radiology. Radiol Artif Intell. 2022; 4(4):e210258. PMC: 9344353. DOI: 10.1148/ryai.210258. View

3.
Lee P, Bubeck S, Petro J . Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023; 388(13):1233-1239. DOI: 10.1056/NEJMsr2214184. View

4.
Sallam M . ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023; 11(6). PMC: 10048148. DOI: 10.3390/healthcare11060887. View

5.
Yeo Y, Samaan J, Ng W, Ting P, Trivedi H, Vipani A . Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023; 29(3):721-732. PMC: 10366809. DOI: 10.3350/cmh.2023.0089. View