Building an Evaluation Scale Using Item Response Theory
Overview
Authors
Affiliations
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.
Colledani D, Robusto E, Anselmi P Front Psychol. 2024; 15:1405822.
PMID: 39554711 PMC: 11564814. DOI: 10.3389/fpsyg.2024.1405822.
IRTCI: Item Response Theory for Categorical Imputation.
Kline A, Luo Y Res Sq. 2024; .
PMID: 39011102 PMC: 11247932. DOI: 10.21203/rs.3.rs-4529519/v1.
Alemayehu G, Mamo G, Desta H, Alemu B, Wieland B One Health. 2021; 12:100223.
PMID: 33614884 PMC: 7879039. DOI: 10.1016/j.onehlt.2021.100223.
Lalor J, Wu H, Munkhdalai T, Yu H Proc Conf Empir Methods Nat Lang Process. 2020; 2018:4711-4716.
PMID: 33241233 PMC: 7685075. DOI: 10.18653/v1/d18-1500.
Kline A, Kline T, Shakeri Hossein Abad Z, Lee J J Med Internet Res. 2020; 22(9):e20268.
PMID: 32975523 PMC: 7547395. DOI: 10.2196/20268.