» Articles » PMID: 37625267

Benchmarking Large Language Models' Performances for Myopia Care: a Comparative Analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard

Abstract

Background: Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs' accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries.

Methods: We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains-pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. 'Good' rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, 'poor' rated responses were further prompted for self-correction and then re-evaluated for accuracy.

Findings: ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as 'good', compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for 'treatment and prevention'. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% 'good' ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001).

Interpretation: Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial.

Funding: Dr Yih-Chung Tham was supported by the National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).

Citing Articles

Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'.

Zhao F, He H, Liang J, Cen L Eye (Lond). 2025; .

PMID: 40044837 DOI: 10.1038/s41433-025-03737-x.


Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review.

Fattah F, Salih A, Salih A, Asaad S, Ghafour A, Bapir R Front Digit Health. 2025; 7:1482712.

PMID: 39963119 PMC: 11830737. DOI: 10.3389/fdgth.2025.1482712.


Large Language Models for Chatbot Health Advice Studies: A Systematic Review.

Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen J, McKechnie T JAMA Netw Open. 2025; 8(2):e2457879.

PMID: 39903463 PMC: 11795331. DOI: 10.1001/jamanetworkopen.2024.57879.


Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.

Zhu K, Zhang J, Klishin A, Esser M, Blumentals W, Juhaeri J Pharmacoepidemiol Drug Saf. 2025; 34(2):e70111.

PMID: 39901360 PMC: 11791122. DOI: 10.1002/pds.70111.


Evaluation and practical application of prompt-driven ChatGPTs for EMR generation.

Ding H, Xia W, Zhou Y, Wei L, Feng Y, Wang Z NPJ Digit Med. 2025; 8(1):77.

PMID: 39894840 PMC: 11788423. DOI: 10.1038/s41746-025-01472-x.