Detecting Hallucinations in Large Language Models Using Semantic Entropy

Overview

Journal Nature

Specialty Science

Date 2024 Jun 19

PMID 38898292

Authors

Sebastian Farquhar

Jannik Kossen

Lorenz Kuhn

Yarin Gal

Affiliations

Soon will be listed here.

Abstract

Large language model (LLM) systems, such as ChatGPT or Gemini, can show impressive reasoning and question-answering capabilities but often 'hallucinate' false outputs and unsubstantiated answers. Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents or untrue facts in news articles and even posing a risk to human life in medical domains such as radiology. Encouraging truthfulness through supervision or reinforcement has been only partially successful. Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations-confabulations-which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Citing Articles

Foundation models in bioinformatics.

Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C Natl Sci Rev. 2025; 12(4):nwaf028.

PMID: 40078374 PMC: 11900445. DOI: 10.1093/nsr/nwaf028.

Transforming hematological research documentation with large language models: an approach to scientific writing and data analysis.

Yang J, Hwang S Blood Res. 2025; 60(1):15.

PMID: 40047976 PMC: 11885755. DOI: 10.1007/s44313-025-00062-w.

Artificial intelligence for modelling infectious disease epidemics.

Kraemer M, Tsui J, Chang S, Lytras S, Khurana M, Vanderslott S Nature. 2025; 638(8051):623-635.

PMID: 39972226 DOI: 10.1038/s41586-024-08564-w.

Learning and actioning general principles of cancer cell drug sensitivity.

Carli F, Di Chiaro P, Morelli M, Arora C, Bisceglia L, De Oliveira Rosa N Nat Commun. 2025; 16(1):1654.

PMID: 39952993 PMC: 11828915. DOI: 10.1038/s41467-025-56827-5.

The foundational capabilities of large language models in predicting postoperative risks using clinical notes.

Alba C, Xue B, Abraham J, Kannampallil T, Lu C NPJ Digit Med. 2025; 8(1):95.

PMID: 39934379 PMC: 11814325. DOI: 10.1038/s41746-025-01489-2.

References

Shen Y, Heacock L, Elias J, Hentel K, Reig B, Shih G . ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. 2023; 307(2):e230163. DOI: 10.1148/radiol.230163. View

Berrios G . Confabulations: a conceptual history. J Hist Neurosci. 2001; 7(3):225-41. DOI: 10.1076/jhin.7.3.225.1855. View

Tsatsaronis G, Balikas G, Malakasiotis P, Partalas I, Zschunke M, Alvers M . An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics. 2015; 16:138. PMC: 4450488. DOI: 10.1186/s12859-015-0564-6. View