AI Chatbots Will Never Stop Hallucinating

Last summer a federal judge fined a New York City law firm $5,000 after a lawyer used the artificial intelligence tool ChatGPT to draft a brief for a personal injury case. The text was full of falsehoods—including more than six entirely fabricated past cases meant to establish precedent for the personal injury suit. Similar errors are rampant across AI-generated legal outputs, researchers at Stanford University and Yale University found in a recent preprint study of three popular large language models (LLMs). There’s a term for when generative AI models produce responses that don’t match reality: “hallucination.”

Hallucination is usually framed as a technical problem with AI—one that hardworking developers will eventually solve. But many machine-learning experts don’t view hallucination as fixable because it stems from LLMs doing exactly what they were developed and trained to do: respond, however they can, to user prompts. The real problem, according to some AI researchers, lies in our collective ideas about what these models are and how we’ve decided to use them. To mitigate hallucinations, the researchers say, generative AI tools must be paired with fact-checking systems that leave no chatbot unsupervised.

Many conflicts related to AI hallucinations have roots in marketing and hype. Tech companies have portrayed their LLMs as digital Swiss Army knives, capable of solving myriad problems or replacing human work. But applied in the wrong setting, these tools simply fail. Chatbots have offered users incorrect and potentially harmful medical advice, media outlets have published AI-generated articles that included inaccurate financial guidance, and search engines with AI interfaces have invented fake citations. As more people and businesses rely on chatbots for factual information, their tendency to make things up becomes even more apparent and disruptive.

On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

But today’s LLMs were never designed to be purely accurate. They were created to create—to generate—says Subbarao Kambhampati, a computer science professor who researches artificial intelligence at Arizona State University. “The reality is: there’s no way to guarantee the factuality of what is generated,” he explains, adding that all computer-generated “creativity is hallucination, to some extent.”

In a preprint study released in January, three machine-learning researchers at the National University of Singapore presented a proof that hallucination is inevitable in large language models. The proof applies some classic results in learning theory, such as Cantor’s diagonalization argument, to demonstrate that that LLMs simply cannot learn all computable functions. In other words, it shows that there will always be solvable problems beyond a model’s abilities. “For any LLM, there is a part of the real world that it cannot learn, where it will inevitably hallucinate,” wrote study co-authors Ziwei Xu, Sanjay Jain and Mohan Kankanhalli in a joint e-mail to Scientific American.

Although the proof appears to be accurate, Kambhampati says, the argument it makes—that certain difficult problems will always stump computers—is too broad to provide much insight into why specific confabulations happen. And, he continues, the issue is more widespread than the proof shows because LLMs hallucinate even when faced with simple requests.

One main reason AI chatbots routinely hallucinate stems from their fundamental construction, says Dilek Hakkani-Tür, a computer science professor who studies natural language and speech processing at the University of Illinois at Urbana-Champaign. LLMs are basically hyperadvanced autocomplete tools; they are trained to predict what should come next in a sequence such as a string of text. If a model’s training data include lots of information on a certain subject, it might produce accurate outputs. But LLMs are built to always produce an answer, even on topics that don’t appear in their training data. Hakkani-Tür says this increases the chance errors will emerge.

Adding more factually grounded training data might seem like an obvious solution. But there are practical and physical limits to how much information an LLM can hold, says computer scientist Amr Awadallah, co-founder and CEO of the AI platform Vectara, which tracks hallucination rates among LLMs on a leaderboard. (The lowest hallucination rates among tracked AI models are around 3 to 5 percent.) To achieve their language fluency, these massive models are trained on orders of magnitude more data than they can store—and data compression is the inevitable result. When LLMs cannot “recall everything exactly like it was in their training, they make up stuff and fill in the blanks,” Awadallah says. And, he adds, these models already operate at the edge of our computing capacity; trying to avoid hallucinations by making LLMs larger would produce slower models that are more expensive and more environmentally harmful to operate.

Another cause of hallucination is calibration, says Santosh Vempala, a computer science professor at the Georgia Institute of Technology. Calibration is the process by which LLMs are adjusted to favor certain outputs over others (to match the statistics of training data or to generate more realistically human-sounding phrases).* In a preprint paper first released last November, Vempala and a coauthor suggest that any calibrated language model will hallucinate—because accuracy itself is sometimes at odds with text that flows naturally and seems original. Reducing calibration can boost factuality while simultaneously introducing other flaws in LLM-generated text. Uncalibrated models might write formulaically, repeating words and phrases more often than a person would, Vempala says. The problem is that users expect AI chatbots to be both factual and fluid.

Accepting that LLMs may never be able to produce completely accurate outputs means reconsidering when, where and how we deploy these generative tools, Kambhampati says. They are wonderful idea generators, he adds, but they are not independent problem solvers. “You can leverage them by putting them into an architecture with verifiers,” he explains—whether that means putting more humans in the loop or using other automated programs.

At Vectara, Awadallah is working on exactly that. His team’s leaderboard project is an early proof of concept for a hallucination detector—and detecting hallucinations is the first step to being able to fix them, he says. A future detector might be paired with an automated AI editor that corrects errors before they reach an end user. His company is also working on a hybrid chatbot and news database called AskNews, which combines an LLM with a retrieval engine that picks the most relevant facts from recently published articles to answer a user’s question. Awadallah says AskNews provides descriptions of current events that are significantly more accurate than what an LLM alone could produce because the chatbot bases its responses only on the sources dredged up by the database search tool.

Hakkani-Tür, too, is researching factually grounded systems that pair specialized language models with relatively reliable information sources such as corporate documents, verified product reviews, medical literature or Wikipedia posts to boost accuracy. She hopes that—once all the kinks are ironed out—these grounded networks could one day be useful tools for things like health access and educational equity. “I do see the strength of language models as tools for making our lives better, more productive and more fair,” she says.

In a future where specialized systems verify LLM outputs, AI tools designed for specific contexts would partially replace today’s all-purpose models. Each application of an AI text generator (be it a customer service chatbot, a news summary service or even a legal adviser) would be part of a custom-built architecture that would enable its utility. Meanwhile less- grounded generalist chatbots would be able to respond to anything you ask but with no guarantee of truth. They would continue to be powerful creative partners or sources of inspiration and entertainment—yet not oracles or encyclopedias—exactly as designed.

*Editor’s Note (4/5/24): This sentence was edited after posting. It previously stated that mitigating bias in a large language model’s output is an example of calibration. That is instead a separate process known as alignment.

Read More

Lauren Leffer