Scientists Develop New Algorithm to Spot AI 'Hallucinations'

An image of multiple 3D shapes representing speech bubbles in a sequence, with broken up fragments of text within them. Credit - Wes Cockx & Google DeepMind—Better Images of AI—AI large language models—CC-BY 4.0

An enduring problem with today’s generative artificial intelligence (AI) tools, like ChatGPT, is that they often confidently assert false information. Computer scientists call this behavior “hallucination,” and it’s a key barrier to AI’s usefulness.

Hallucinations have led to some embarrassing public slip-ups. In February, AirCanada was forced by a tribunal to honor a discount that its customer-support chatbot had mistakenly offered to a passenger. In May, Google was forced to make changes to its new “AI overviews” search feature, after the bot told some users that it was safe to eat rocks. And last June, two lawyers were fined $5,000 by a U.S. judge after one of them admitted he had used ChatGPT to help write a court filing. He came clean because the chatbot had added fake citations to the submission, which pointed to cases that never existed.

But in good news for lazy lawyers, lumbering search giants, and errant airlines, at least some types of AI hallucinations could soon be a thing of the past. New research, published Wednesday in the peer-reviewed scientific journal Nature, describes a new method for detecting when an AI tool is likely to be hallucinating. The method described in the paper is able to discern between correct and incorrect AI-generated answers approximately 79% of the time, which is approximately 10 percentage points higher than other leading methods. Although the method only addresses one of the several causes of AI hallucinations, and requires approximately 10 times more computing power than a standard chatbot conversation, the results could pave the way for more reliable AI systems in the near future.

“My hope is that this opens up ways for large language models to be deployed where they can't currently be deployed – where a little bit more reliability than is currently available is needed,” says Sebastian Farquhar, an author of the study, who is a senior research fellow at Oxford University’s department of computer science, where the research was carried out, and is also a research scientist on Google DeepMind’s safety team. Of the lawyer who was fined for relying on a ChatGPT hallucination, Farquhar says: “This would have saved him.”

Hallucination has become a common term in the world of AI, but it is also a controversial one. For one, it implies that models have some kind of subjective experience of the world, which most computer scientists agree they do not. It suggests that hallucinations are a solvable quirk, rather than a fundamental and perhaps ineradicable problem of large language models (different camps of AI researchers disagree on the answer to this question). Most of all, the term is imprecise, describing several different categories of error.

Read More: The A to Z of Artificial Intelligence

Farquhar’s team decided to focus on one specific category of hallucinations, which they call “confabulations.” That’s when an AI model spits out inconsistent wrong answers to a factual question, as opposed to the same consistent wrong answer, which is more likely to stem from problems with a model’s training data, a model lying in pursuit of a reward, or structural failures in a model’s logic or reasoning. It’s difficult to quantify what percentage of all AI hallucinations are confabulations, Farquhar says, but it’s likely to be large. “The fact that our method, which only detects confabulations, makes a big dent on overall correctness suggests that a large number of incorrect answers are coming from these confabulations,” he says.

The methodology

The method used in the study to detect whether a model is likely to be confabulating is relatively simple. First, the researchers ask a chatbot to spit out a handful (usually between five and 10) answers to the same prompt. Then, they use a different language model to cluster those answers based on their meanings. For example, “Paris is the capital of France” and “France’s capital city is Paris” would be assigned to the same group because they mean the same thing, even though the wording of each sentence is different. “France’s capital city is Rome” would be assigned to a different group.

The researchers then calculate a number that they call “semantic entropy” – in other words, a measure of how similar or different the meanings of each answer are. If the model’s answers all have different meanings, the semantic entropy score would be high, indicating that the model is confabulating. If the model’s answers all have identical or similar meanings, the semantic entropy score will be low, indicating that the model is giving a consistent answer—and is therefore unlikely to be confabulating. (The answer could still be consistently wrong, but this would be a different form of hallucination, for example one caused by problematic training data.)

The researchers said the method of detecting semantic entropy outperformed several other approaches for detecting AI hallucinations. Those methods included "naive entropy," which only detects whether the wording of a sentence, rather than its meaning, is different; a method called "P(True)" which asks the model to assess the truthfulness of its own answers; and an approach called "embedding regression," in which an AI is fine-tuned on correct answers to certain questions. Embedding regression is effective at ensuring AIs accurately answer questions about specific subject matter, but fails when different kinds of questions are asked. One significant difference between the method described in the paper and embedding regression is that the new method doesn’t require sector-specific training data—for example, it doesn’t require training a model to be good at science in order to detect potential hallucinations in answers to science-related questions. This means it works with similar effects across different subject areas, according to the paper.

Farquhar has some ideas for how semantic entropy could begin reducing hallucinations in leading chatbots. He says it could in theory allow OpenAI to add a button to ChatGPT, where a user could click on an answer, and get a certainty score that would allow them to feel more confident about whether a result is accurate. He says the method could also be built-in under the hood to other tools that use AI in high-stakes settings, where trading off speed and cost for accuracy is more desirable.

While Farquhar is optimistic about the potential of their method to improve the reliability of AI systems, some experts caution against overestimating its immediate impact. Arvind Narayanan, a professor of computer science at Princeton University, acknowledges the value of the research but emphasizes the challenges of integrating it into real-world applications. "I think it's nice research … [but] it's important not to get too excited about the potential of research like this," he says. "The extent to which this can be integrated into a deployed chatbot is very unclear."

Read More: Arvind Narayanan is on the TIME100 AI

Narayanan notes that with the release of better models, the rates of hallucinations (not just confabulations) have been declining. But he’s skeptical the problem will disappear any time soon. “In the short to medium term, I think it is unlikely that hallucination will be eliminated. It is, I think, to some extent intrinsic to the way that LLMs function,” he says. He points out that, as AI models become more capable, people will try to use them for increasingly difficult tasks where failure might be more likely. “There's always going to be a boundary between what people want to use them for, and what they can work reliably at,” he says. “That is as much a sociological problem as it is a technical problem. And I don't think it has a clean technical solution.”

Write to Billy Perrigo at