» Articles » PMID: 30453897

Reliability in Evaluator-based Tests: Using Simulation-constructed Models to Determine Contextually Relevant Agreement Thresholds

Overview
Publisher Biomed Central
Date 2018 Nov 21
PMID 30453897
Citations 5
Authors
Affiliations
Soon will be listed here.
Abstract

Background: Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary "rules of thumb" or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability.

Methods: Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff's alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement.

Results: We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff's alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff's alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff's alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error.

Conclusions: We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted "rule of thumb" cutoff for Krippendorff's alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.

Citing Articles

Body Positivity, Physical Health, and Emotional Well-Being Discourse on Social Media: Content Analysis of Lizzo's Instagram.

Albert S, Massar R, Cassidy O, Fennelly K, Jay M, Massey P JMIR Form Res. 2024; 8:e60541.

PMID: 39496156 PMC: 11574494. DOI: 10.2196/60541.


Hypothesis testing for detecting outlier evaluators.

Xu L, Zucker D, Wang M Int J Biostat. 2024; 20(2):419-431.

PMID: 39485244 PMC: 11661559. DOI: 10.1515/ijb-2023-0004.


RuSentiTweet: a sentiment analysis dataset of general domain tweets in Russian.

Smetanin S PeerJ Comput Sci. 2022; 8:e1039.

PMID: 36092008 PMC: 9454938. DOI: 10.7717/peerj-cs.1039.


Bedside colorimetric reagent dipstick in the diagnosis of meningitis in low- and middle-income countries: A prospective, international blinded comparison with laboratory analysis.

Wendler C, Mashimango L, Remi T, Larochelle P, Kang E, Brotherton B Afr J Emerg Med. 2022; 12(3):161-164.

PMID: 35599842 PMC: 9118351. DOI: 10.1016/j.afjem.2022.04.004.


Comparison of original and modified Q risk 2 risk score with Framingham risk score - An Indian perspective.

Aggarwal P, Sinha S, Khanra D, Nath R, Gujral J, Reddy K Indian Heart J. 2021; 73(3):353-358.

PMID: 34154755 PMC: 8322747. DOI: 10.1016/j.ihj.2021.01.016.

References
1.
Swanton A, Arlen A, Alexander S, Kieran K, Storm D, Cooper C . Inter-rater reliability of distal ureteral diameter ratio compared to grade of VUR. J Pediatr Urol. 2017; 13(2):207.e1-207.e5. DOI: 10.1016/j.jpurol.2016.10.021. View

2.
Rohan K, Rough J, Evans M, Ho S, Meyerhoff J, Roberts L . A protocol for the Hamilton Rating Scale for Depression: Item scoring rules, Rater training, and outcome accuracy with data on its application in a clinical trial. J Affect Disord. 2016; 200:111-8. PMC: 4894486. DOI: 10.1016/j.jad.2016.01.051. View

3.
Wikstrom E, Allen G . Reliability of two-point discrimination thresholds using a 4-2-1 stepping algorithm. Somatosens Mot Res. 2016; 33(3-4):156-160. DOI: 10.1080/08990220.2016.1227313. View

4.
Walter S, Eliasziw M, Donner A . Sample size and optimal designs for reliability studies. Stat Med. 1998; 17(1):101-10. DOI: 10.1002/(sici)1097-0258(19980115)17:1<101::aid-sim727>3.0.co;2-e. View

5.
Zwick R . Another look at interrater agreement. Psychol Bull. 1988; 103(3):374-8. DOI: 10.1037/0033-2909.103.3.374. View