Building an Evaluation Scale Using Item Response Theory

Overview

Journal Proc Conf Empir Methods Nat Lang Process

Date 2016 Dec 23

PMID 28004039

Citations 7

Authors

John P Lalor

Hao Wu

Hong Yu

Affiliations

Soon will be listed here.

Abstract

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

Citing Articles

Assessing key soft skills in organizational contexts: development and validation of the multiple soft skills assessment tool.

Colledani D, Robusto E, Anselmi P Front Psychol. 2024; 15:1405822.

PMID: 39554711 PMC: 11564814. DOI: 10.3389/fpsyg.2024.1405822.

IRTCI: Item Response Theory for Categorical Imputation.

Kline A, Luo Y Res Sq. 2024; .

PMID: 39011102 PMC: 11247932. DOI: 10.21203/rs.3.rs-4529519/v1.

Knowledge, attitude, and practices to zoonotic disease risks from livestock birth products among smallholder communities in Ethiopia.

Alemayehu G, Mamo G, Desta H, Alemu B, Wieland B One Health. 2021; 12:100223.

PMID: 33614884 PMC: 7879039. DOI: 10.1016/j.onehlt.2021.100223.

Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study.

Lalor J, Wu H, Munkhdalai T, Yu H Proc Conf Empir Methods Nat Lang Process. 2020; 2018:4711-4716.

PMID: 33241233 PMC: 7685075. DOI: 10.18653/v1/d18-1500.

Using Item Response Theory for Explainable Machine Learning in Predicting Mortality in the Intensive Care Unit: Case-Based Approach.

Kline A, Kline T, Shakeri Hossein Abad Z, Lee J J Med Internet Res. 2020; 22(9):e20268.

PMID: 32975523 PMC: 7547395. DOI: 10.2196/20268.

References

Silver D, Huang A, Maddison C, Guez A, Sifre L, Van Den Driessche G . Mastering the game of Go with deep neural networks and tree search. Nature. 2016; 529(7587):484-9. DOI: 10.1038/nature16961. View