Evaluating the Quality and Readability of Information Provided by Generative Artificial Intelligence Chatbots on Clavicle Fracture Treatment Options
Overview
Authors
Affiliations
Introduction Generative artificial intelligence (AI) chatbots, like ChatGPT, have become more competent and prevalent, making their role in patient education more salient. This study aimed to compare the educational utility of six AI chatbots by quantifying the readability and quality of their answers to common patient questions about clavicle fracture management. Methods ChatGPT 4, ChatGPT 4o, Gemini 1.0, Gemini 1.5 Pro, Microsoft Copilot, and Perplexity were used with no prior training. Ten representative patient questions about clavicle fractures were posed to each model. The readability of AI responses was measured using Flesch-Kincaid Reading Grade Level, Gunning Fog, and Simple Measure of Gobbledygook (SMOG). Six orthopedists blindly graded the response quality of each model using the DISCERN criteria. Both metrics were analyzed via the Kruskal-Wallis test. Results No statistically significant difference was found among the readability of the six models. Microsoft Copilot (70.33±7.74) and Perplexity (71.83±7.57) demonstrated statistically significant higher DISCERN scores than ChatGPT 4 (56.67±7.15) and Gemini 1.5 Pro (51.00±8.94) with similar findings seen between Gemini 1.0 (68.00±6.42) and Gemini 1.5 Pro. The mean overall quality (question 16, DISCERN) of each model was rated at or above average (range, 3-4.4). Conclusion The findings suggest generative AI models have the capability to serve as supplementary patient education materials. With equal readability and overall high quality, Microsoft Copilot and Perplexity may be implicated as chatbots with the most educational utility regarding surgical intervention for clavicle fractures.