Validity and Reliability of an Instrument Evaluating the Performance of Intelligent Chatbot: the Artificial Intelligence Performance Instrument (AIPI)

Overview

Journal Eur Arch Otorhinolaryngol

Specialty Otorhinolaryngology

Date 2023 Sep 12

PMID 37698703

Authors

Jerome R Lechien

Antonino Maniaci

Isabelle Gengler

Stephane Hans

Carlos M Chiesa-Estomba

Luigi A Vaira

Affiliations

Soon will be listed here.

Abstract

Objectives: To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI).

Methods: Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test-retest reliability. Internal consistency was evaluated using Cronbach's α. Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall's tau.

Results: Forty-five patients completed the evaluations (28 females). The AIPI Cronbach's alpha analysis suggested an adequate internal consistency (α = 0.754). The test-retest reliability was moderate-to-strong for items and the total score of AIPI (r = 0.486, p = 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity (p = 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT (r = 0.319; p = 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score.

Conclusions: AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields.

Citing Articles

Advancing Clinical Chatbot Validation Using AI-Powered Evaluation With a New 3-Bot Evaluation System: Instrument Validation Study.

Choo S, Yoo S, Endo K, Truong B, Son M JMIR Nurs. 2025; 8:e63058.

PMID: 40014000 PMC: 11884306. DOI: 10.2196/63058.

A radiopathomics model for predicting large-number cervical lymph node metastasis in clinical N0 papillary thyroid carcinoma.

Xiao W, Zhou W, Yuan H, Liu X, He F, Hu X Eur Radiol. 2025; .

PMID: 39881038 DOI: 10.1007/s00330-025-11377-8.

Artificial intelligence for image recognition in diagnosing oral and oropharyngeal cancer and leukoplakia.

Schmidl B, Hutten T, Pigorsch S, Stogbauer F, Hoch C, Hussain T Sci Rep. 2025; 15(1):3625.

PMID: 39880876 PMC: 11779835. DOI: 10.1038/s41598-025-85920-4.

Enhancing Multilingual Patient Education: ChatGPT's Accuracy and Readability for SSNHL Queries in English and Spanish.

Ajit-Roger E, Moise A, Peralta C, Orishchak O, Daniel S OTO Open. 2024; 8(4):e70048.

PMID: 39664064 PMC: 11633712. DOI: 10.1002/oto2.70048.

Harnessing the Power of ChatGPT in Cardiovascular Medicine: Innovations, Challenges, and Future Directions.

Leon M, Ruaengsri C, Pelletier G, Bethencourt D, Shibata M, Flores M J Clin Med. 2024; 13(21).

PMID: 39518681 PMC: 11546989. DOI: 10.3390/jcm13216543.

References

Pernencar C, Saboia I, Dias J . How Far Can Conversational Agents Contribute to IBD Patient Health Care-A Review of the Literature. Front Public Health. 2022; 10:862432. PMC: 9282671. DOI: 10.3389/fpubh.2022.862432. View

Wahlster W . Understanding computational dialogue understanding. Philos Trans A Math Phys Eng Sci. 2023; 381(2251):20220049. DOI: 10.1098/rsta.2022.0049. View

Hill-Yardin E, Hutchinson M, Laycock R, Spencer S . A Chat(GPT) about the future of scientific publishing. Brain Behav Immun. 2023; 110:152-154. DOI: 10.1016/j.bbi.2023.02.022. View

Mohammad B, Supti T, Alzubaidi M, Shah H, Alam T, Shah Z . The Pros and Cons of Using ChatGPT in Medical Education: A Scoping Review. Stud Health Technol Inform. 2023; 305:644-647. DOI: 10.3233/SHTI230580. View

Rekman J, Hamstra S, Dudek N, Wood T, Seabrook C, Gofton W . A New Instrument for Assessing Resident Competence in Surgical Clinic: The Ottawa Clinic Assessment Tool. J Surg Educ. 2016; 73(4):575-82. DOI: 10.1016/j.jsurg.2016.02.003. View

Chen Y, Chiu Y, Chu T, Hsu H, Chen H, Wu C . Is the rating result reliable? A new approach to respond to a medical trainee's concerns about the reliability of Mini-CEX assessment. J Formos Med Assoc. 2021; 121(5):943-949. DOI: 10.1016/j.jfma.2021.07.005. View

Gercama A, de Haan M, Van Der Vleuten C . Reliability of the amsterdam clinical challenge scale (ACCS): a new instrument to assess the level of difficulty of patient cases in medical education. Med Educ. 2000; 34(7):519-24. DOI: 10.1046/j.1365-2923.2000.00663.x. View

Lee V, Brain K, Martin J . Factors Influencing Mini-CEX Rater Judgments and Their Practical Implications: A Systematic Literature Review. Acad Med. 2016; 92(6):880-887. DOI: 10.1097/ACM.0000000000001537. View

Kogan J, Holmboe E, Hauer K . Tools for direct observation and assessment of clinical skills of medical trainees: a systematic review. JAMA. 2009; 302(12):1316-26. DOI: 10.1001/jama.2009.1365. View

10.

Stankovic P, Hoch S, Stuck B, Wilhelm T . Continuous intraoperative neuromonitoring of the facial nerve predicts postoperative facial palsy in parotid surgery: a prospective study. Eur Arch Otorhinolaryngol. 2023; 281(3):1483-1492. DOI: 10.1007/s00405-023-08384-0. View

11.

Chiesa-Estomba C, Speth M, Mayo-Yanez M, Liu D, Maniaci A, Borsetto D . Is the evolving role of artificial intelligence and chatbots in the field of otolaryngology embracing the future?. Eur Arch Otorhinolaryngol. 2023; 281(4):2179-2180. DOI: 10.1007/s00405-023-08382-2. View

12.

Hannaford P, Simpson J, Bisset A, Davis A, McKerrow W, Mills R . The prevalence of ear, nose and throat problems in the community: results from a national cross-sectional postal survey in Scotland. Fam Pract. 2005; 22(3):227-33. DOI: 10.1093/fampra/cmi004. View

13.

Vasileiou I, Giannopoulos A, Klonaris C, Vlasis K, Marinos S, Koutsonasios I . The potential role of primary care in the management of common ear, nose or throat disorders presenting to the emergency department in Greece. Qual Prim Care. 2009; 17(2):145-8. View