» Articles » PMID: 39774443

ChatGPT-4o Can Serve As the Second Rater for Data Extraction in Systematic Reviews

Overview
Journal PLoS One
Date 2025 Jan 8
PMID 39774443
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Systematic reviews provide clarity of a bulk of evidence and support the transfer of knowledge from clinical trials to guidelines. Yet, they are time-consuming. Artificial intelligence (AI), like ChatGPT-4o, may streamline processes of data extraction, but its efficacy requires validation.

Objective: This study aims to (1) evaluate the validity of ChatGPT-4o for data extraction compared to human reviewers, and (2) test the reproducibility of ChatGPT-4o's data extraction.

Methods: We conducted a comparative study using papers from an ongoing systematic review on exercise to reduce fall risk. Data extracted by ChatGPT-4o were compared to a reference standard: data extracted by two independent human reviewers. The validity was assessed by categorizing the extracted data into five categories ranging from completely correct to false data. Reproducibility was evaluated by comparing data extracted in two separate sessions using different ChatGPT-4o accounts.

Results: ChatGPT-4o extracted a total of 484 data points across 11 papers. The AI's data extraction was 92.4% accurate (95% CI: 89.5% to 94.5%) and produced false data in 5.2% of cases (95% CI: 3.4% to 7.4%). The reproducibility between the two sessions was high, with an overall agreement of 94.1%. Reproducibility decreased when information was not reported in the papers, with an agreement of 77.2%.

Conclusion: Validity and reproducibility of ChatGPT-4o was high for data extraction for systematic reviews. ChatGPT-4o was qualified as a second reviewer for systematic reviews and showed potential for future advancements when summarizing data.

References
1.
Feinstein A, Cicchetti D . High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990; 43(6):543-9. DOI: 10.1016/0895-4356(90)90158-l. View

2.
Buscemi N, Hartling L, Vandermeer B, Tjosvold L, Klassen T . Single data extraction generated more errors than double data extraction in systematic reviews. J Clin Epidemiol. 2006; 59(7):697-703. DOI: 10.1016/j.jclinepi.2005.11.010. View

3.
van Dijk S, Brusse-Keizer M, Bucsan C, van der Palen J, Doggen C, Lenferink A . Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023; 13(7):e072254. PMC: 10335470. DOI: 10.1136/bmjopen-2023-072254. View

4.
Qureshi R, Shaughnessy D, Gill K, Robinson K, Li T, Agai E . Are ChatGPT and large language models "the answer" to bringing us closer to systematic review automation?. Syst Rev. 2023; 12(1):72. PMC: 10148473. DOI: 10.1186/s13643-023-02243-z. View

5.
Lapping K, Marsh D, Rosenbaum J, Swedberg E, Sternin J, Sternin M . The positive deviance approach: challenges and opportunities for the future. Food Nutr Bull. 2002; 23(4 Suppl):130-7. View