27 votes

Why language models hallucinate

Posted September 6 by skybrian

Tags: artificial intelligence, language models.large, hallucinations, papers, source.openai

https://openai.com/index/why-language-models-hallucinate/

21 comments

[15]
skybrian (OP)
September 6 (edited September 6)
Link
The authors of this new paper show that, while hallucinations will inevitably show up when pre-training an LLM due to the nature of the task (nothing is labeled as invalid), this is not true of...

The authors of this new paper show that, while hallucinations will inevitably show up when pre-training an LLM due to the nature of the task (nothing is labeled as invalid), this is not true of post-training. Instead, it’s an artifact of benchmarks that don’t penalize wrong answers more than not answering a question. Furthermore, post-training often makes the problem worse.

They say that adding new benchmarks won’t help. The scoring on existing benchmarks needs to be changed so they don’t penalize abstaining the same as guessing wrong.

There’s a self-interested aspect to this because OpenAI recently released GPT-5, which does better at avoiding hallucinations. If the benchmarks are changed then it will get a higher score on leaderboards. But it seems like it’s still the right thing to do?

The paper is here.

20 votes
1. [14]
  0xSim
  September 6
  Link Parent
  Ok but how can a LLM determine if a statement is true? They already output the most statistically plausible answers; from their "point of view" all their responses are the truth. Technically, all...
  
  Instead, it’s an artifact of benchmarks that don’t penalize wrong answers more than not answering a question
  
  Ok but how can a LLM determine if a statement is true? They already output the most statistically plausible answers; from their "point of view" all their responses are the truth.
  
  Technically, all answers are hallucinations. They're not different than any other output, it's just the name we give to answers that we know to be wrong. Maybe LLMs could add a "percentage of confidence" to their answers, instead of confidently spewing lies? Sure it will make them sound like robots, but that's what they are, and it wouldn't be a bad thing to remind users.
  
  23 votes
  1. [4]
    JCPhoenix
    September 6
    Link Parent
    If done correctly, it wouldn't even require the LLM to output a raw percent confidence. After all, people do this all the time. "I'm not exactly sure, but I think...," or "I feel like maybe that's...
    
    If done correctly, it wouldn't even require the LLM to output a raw percent confidence. After all, people do this all the time. "I'm not exactly sure, but I think...," or "I feel like maybe that's how that works," or "I'm like 75% certain that's the answer."
    
    No one likes an all-confident know-it-all. Especially one that doesn't actual know what they're talking about. Yet that's exactly how LLMs behave. And people eat it up without question. Me included, at times. Kinda weird now that I think about it.
    
    11 votes
    
    [3]
    Sodliddesu
    September 6
    Link Parent
    People don't like know-it-alls not because they don't like knowledge but because they don't like being talked down to. Starting off a sentence with "I think you're on the right path but, in my...
    
    People don't like know-it-alls not because they don't like knowledge but because they don't like being talked down to.
    
    Starting off a sentence with "I think you're on the right path but, in my experience, I think..." primes people to accept what you're saying - the "Yes, and" instead of the "No, but".
    
    LLMs always start by buttering you up and then blowing the smoke.
    
    12 votes
    
    [2]
    Raistlin
    September 6
    Link Parent
    I have a permanent CoPilot prompt to stfu with the compliments and brownnosing. It ignores it like half the time.
    
    I have a permanent CoPilot prompt to stfu with the compliments and brownnosing. It ignores it like half the time.
    
    12 votes
    
    Sodliddesu
    September 6
    Link Parent
    You're right, I'm sorry, it was wrong of me to question one so handsome and skillful as yourself. In the future, I'll be sure not to ignore that prompt. I avoid the whole thing, personally. By...
    
    You're right, I'm sorry, it was wrong of me to question one so handsome and skillful as yourself.
    
    In the future, I'll be sure not to ignore that prompt.
    
    I avoid the whole thing, personally. By simply doing things wrong myself, I never have to worry about accidentally being wrong but confidently.
    
    11 votes
  2. [4]
    skybrian (OP)
    September 6
    Link Parent
    LLM’s learn from feedback during training. The authors recommend that during post-training, exam questions should allow an “I don’t know” response and that instructions for the test should...
    
    LLM’s learn from feedback during training. The authors recommend that during post-training, exam questions should allow an “I don’t know” response and that instructions for the test should explicitly state how much a wrong answer is penalized. It would at least stop training LLM’s to guess at answers when the probability of guessing right is low.
    
    They admit that classifying responses as right, wrong, or “I don’t know” is still a simplification, but it’s better than what most benchmarks do now. Further improvements are left for someone else to research. They do point out that outputting a degree of confidence might result in unhelpful answers, like “I’m 1/365 certain that Kalai’s birthday is March 7th.”
    
    (You’re describing the pre-training task which is predicting the next token, but post-training involves higher-level considerations.)
    
    8 votes
    
    [3]
    0xSim
    September 6
    Link Parent
    The article does not explain what comes after that pre-training. Sure, water is wet I guess, but how does the LLM determine what is an error? Maybe there's a way, I'm no LLM expert, but there's...
    
    (You’re describing the pre-training task which is predicting the next token, but post-training involves higher-level considerations.)
    
    The article does not explain what comes after that pre-training.
    
    There is a straightforward fix. Penalize confident errors more than you penalize uncertainty
    
    Sure, water is wet I guess, but how does the LLM determine what is an error? Maybe there's a way, I'm no LLM expert, but there's nothing of substance here.
    
    That is a whole paper from OpenAI that says literally nothing new. There's 0 technical insight, it just rehashes what we (people susceptible to read posts from OpenAI) already know, and it reads exactly like a bland answer straight out from ChatGPT. A lot of words to say nothing, with "conclusions" that could come from 2022.
    
    4 votes
    
    [2]
    skybrian (OP)
    September 6
    Link Parent
    Yes, it assumes the reader is already familiar with it. I'm not a machine learning expert, but I believe there is instruction tuning (which is how it learns to follow directions instead of just...
    
    The article does not explain what comes after that pre-training.
    
    Yes, it assumes the reader is already familiar with it. I'm not a machine learning expert, but I believe there is instruction tuning (which is how it learns to follow directions instead of just autocomplete) and RLHF (which stands for Reinforcement Learning from Human Feedback.) And probably other things by now.
    
    If you're not familiar enough with the field to even understand the terminology, how can you be sure it's "a lot of words to say nothing?"
    
    8 votes
    
    0xSim
    September 6
    Link Parent
    Because it boils down to 2 points: What are hallucinations and what causes them. I think everyone with minimal interest in LLMs in 2025 knows what they are. How to reduce them. And there's 0...
    
    Because it boils down to 2 points:
    
    What are hallucinations and what causes them. I think everyone with minimal interest in LLMs in 2025 knows what they are.
    
    How to reduce them. And there's 0 actual answer here.
    
    I got nothing out of this article. It's written for people who don't know what are hallucinations, so if I were more familiar with the field, I'd still get nothing.
    
    4 votes
  3. [2]
    unkz
    September 6
    Link Parent
    OpenAI's playground used to have an option to colour code the response according to the logprobs for the selected tokens. It was pretty interesting to explore. They removed that for some reason,...
    
    OpenAI's playground used to have an option to colour code the response according to the logprobs for the selected tokens. It was pretty interesting to explore. They removed that for some reason, but there are still options to get at that data without a ton of effort like https://github.com/hobson/problog .
    
    Definitely there are times when the probability distribution that's being sampled from is highly ambiguous -- I've seen cases where asking a yes or no question had about equal chances of selecting yes or no. A lot of the time though, the measurable ambiguity is just in the phrasing. I was looking at a case recently where the model was primarily hung up on deciding whether to say "died", "passed away", or "committed suicide" when the actual underlying truth was it was conflating two different children of Oppenheimer, and was trying to tell me that Peter Oppenheimer was dead when in reality he is currently living and his sister Katherine committed suicide. According to the probabilities, it was quite entirely convinced in the moment that Peter was dead.
    
    8 votes
    
    saturnV
    September 8
    Link Parent
    apparently logprobs can be used for "stealing" weights through distillation
    
    apparently logprobs can be used for "stealing" weights through distillation
    
    3 votes
  4. [3]
    lelio
    September 6
    Link Parent
    A while ago I told chatGPT to start each answer with it's confidence level. I haven't decided if it's actually helpful or accurate yet. It has never given a confidence rating below 50%. But since...
    
    A while ago I told chatGPT to start each answer with it's confidence level.
    
    I haven't decided if it's actually helpful or accurate yet. It has never given a confidence rating below 50%. But since GPT5 I can't remember seeing an obvious hallucination yet either.
    
    1 vote
    
    0xSim
    September 6
    Link Parent
    The problem, I think, is when you ask it to rate its answers, it will hallucinate that value as much as the rest of the answer. The confidence rating must be a metadata.
    
    The problem, I think, is when you ask it to rate its answers, it will hallucinate that value as much as the rest of the answer. The confidence rating must be a metadata.
    
    16 votes
    
    skybrian (OP)
    September 6
    Link Parent
    I think that might be okay as long as it thinks through its answer before stating its confidence level. Just in case, I'd ask it to put it last.
    
    I think that might be okay as long as it thinks through its answer before stating its confidence level. Just in case, I'd ask it to put it last.
    
    2 votes
[2]
TonesTones
September 6
Link
I want to preface by hedging that it’s been quite some time since I’ve engaged in math research, and some of the concepts fly over my head. I think the authors’ hypothesis is a reasonable one, so...
- Exemplary
I want to preface by hedging that it’s been quite some time since I’ve engaged in math research, and some of the concepts fly over my head.

I think the authors’ hypothesis is a reasonable one, so I wanted to skim through the paper to see what new evidence they had to arrive at this conclusion. I was very disappointed by the actual text of the paper, since it seems like nothing more than a well-defined hypothesis.

Mathematically, it seemed like the paper made a lot of assumptions about how LLMs and the training process worked in order to arrive at its conclusion, without a whole lot of evidence or citations showing why those assumptions were valid. I won’t get into the weeds since I’m probably wrong about the details and I think I can make my argument without those details.

See, modern ML research is a lot more like science than math. Making unjustified assumptions about underlying functions and distributions is how most research is done; then, you design and perform experiments to try and develop something meaningful regardless of your assumptions. The most surprising thing about this paper was the lack of any meaningful experiments or data to justify its conclusions. Especially surprising considering that OpenAI definitely has the resources to run experiments on models.

That brings me to the kicker: if OpenAI had actually figured out methodology to reduce hallucinations, it would be a trade secret and would never have been published. It’s possible that’s what happened. Perhaps this paper originally had made progress on hallucinations, but was censored for business reasons. It could have been published anyway for internal political reasons, as signals to investors, or for public attention.

It’s also possible that they know their hypothesis is wrong from experiments, and they are publishing anyway to throw competing model developers off their scent and give them a red herring. This is the problem with profit-driven businesses being responsible for research. There are already so many perverse incentives about reputation in academia, and if you introduce the monetary and political incentives of business, it becomes so hard to tell if anything published is legitimate.

10 votes
1. skybrian (OP)
  September 6 (edited September 6)
  Link Parent
  I think that's a bit too cynical. Yes, they aren't revealing their trade secrets, but they are bragging that GPT-5 does better on hallucinations (on the landing page, not in the paper) and they're...
  
  I think that's a bit too cynical. Yes, they aren't revealing their trade secrets, but they are bragging that GPT-5 does better on hallucinations (on the landing page, not in the paper) and they're advocating for better benchmarks that show that, which both helps them and improves incentives for everyone.
  
  I skipped the math, but it seems to me that the recommended changes are pretty transparent and make sense conceptually. Rewarding competing AI teams for creating models that don't guess at answers seems good? How much proof is really needed?
  
  6 votes
[4]
EgoEimi
September 6
Link
The crux of it: In real life, penalties for guessing are context-dependent. In some contexts, like cooking, guessing is harmless: worst case scenario you ruin a meal. In other contexts, like...

The crux of it:

As another example, suppose a language model is asked for someone’s birthday but doesn’t know. If it guesses “September 10,” it has a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Over thousands of test questions, the guessing model ends up looking better on scoreboards than a careful model that admits uncertainty.

There is a straightforward fix. Penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty...

In real life, penalties for guessing are context-dependent. In some contexts, like cooking, guessing is harmless: worst case scenario you ruin a meal. In other contexts, like aerospace engineering, accuracy is paramount; lives are at stake.

The conclusions section is interesting:

Finding: Accuracy will never reach 100% because, regardless of model size, search and reasoning capabilities, some real-world questions are inherently unanswerable.

Truth maybe exists only in the pure domains of math and logic: outside of them, truth is merely a lossy compressed descriptions of our infinitely detailed reality, and the various ways we represent and transmit those descriptions will always leave out detail.

3 votes
1. williams_482
  September 8
  Link Parent
  This is an ironic example, because LLMs recommending sketchy food choices is a genuine danger. We've seen them recommend putting Elmers glue in pizza cheese, and worse, make lethally wrong...
  
  In some contexts, like cooking, guessing is harmless: worst case scenario you ruin a meal.
  
  This is an ironic example, because LLMs recommending sketchy food choices is a genuine danger. We've seen them recommend putting Elmers glue in pizza cheese, and worse, make lethally wrong recommendations about what forageable mushrooms are safe to eat.
  
  Guessing becomes dangerous in a much wider range of situations when the guesser lacks the "sanity check" guardrails that most most adult humans have.
  
  3 votes
2. F13
  September 8
  Link Parent
  But, abstractly speaking, a model that can correctly identify those questions and respond with "This question isn't really answerable, but here's some relevant information" 100% of the time should...
  
  some real-world questions are inherently unanswerable
  
  But, abstractly speaking, a model that can correctly identify those questions and respond with "This question isn't really answerable, but here's some relevant information" 100% of the time should be considered 100% accurate IMO.
  
  2 votes
3. saturnV
  September 8
  Link Parent
  some logic questions are also inherently unanswerable! see halting problem
  
  some logic questions are also inherently unanswerable! see halting problem
  
  1 vote