The authors of this new paper show that, while hallucinations will inevitably show up when pre-training an LLM due to the nature of the task (nothing is labeled as invalid), this is not true of...
The authors of this new paper show that, while hallucinations will inevitably show up when pre-training an LLM due to the nature of the task (nothing is labeled as invalid), this is not true of post-training. Instead, it’s an artifact of benchmarks that don’t penalize wrong answers more than not answering a question. Furthermore, post-training often makes the problem worse.
They say that adding new benchmarks won’t help. The scoring on existing benchmarks needs to be changed so they don’t penalize abstaining the same as guessing wrong.
There’s a self-interested aspect to this because OpenAI recently released GPT-5, which does better at avoiding hallucinations. If the benchmarks are changed then it will get a higher score on leaderboards. But it seems like it’s still the right thing to do?
Ok but how can a LLM determine if a statement is true? They already output the most statistically plausible answers; from their "point of view" all their responses are the truth. Technically, all...
Instead, it’s an artifact of benchmarks that don’t penalize wrong answers more than not answering a question
Ok but how can a LLM determine if a statement is true? They already output the most statistically plausible answers; from their "point of view" all their responses are the truth.
Technically, all answers are hallucinations. They're not different than any other output, it's just the name we give to answers that we know to be wrong. Maybe LLMs could add a "percentage of confidence" to their answers, instead of confidently spewing lies? Sure it will make them sound like robots, but that's what they are, and it wouldn't be a bad thing to remind users.
If done correctly, it wouldn't even require the LLM to output a raw percent confidence. After all, people do this all the time. "I'm not exactly sure, but I think...," or "I feel like maybe that's...
If done correctly, it wouldn't even require the LLM to output a raw percent confidence. After all, people do this all the time. "I'm not exactly sure, but I think...," or "I feel like maybe that's how that works," or "I'm like 75% certain that's the answer."
No one likes an all-confident know-it-all. Especially one that doesn't actual know what they're talking about. Yet that's exactly how LLMs behave. And people eat it up without question. Me included, at times. Kinda weird now that I think about it.
People don't like know-it-alls not because they don't like knowledge but because they don't like being talked down to. Starting off a sentence with "I think you're on the right path but, in my...
People don't like know-it-alls not because they don't like knowledge but because they don't like being talked down to.
Starting off a sentence with "I think you're on the right path but, in my experience, I think..." primes people to accept what you're saying - the "Yes, and" instead of the "No, but".
LLMs always start by buttering you up and then blowing the smoke.
LLM’s learn from feedback during training. The authors recommend that during post-training, exam questions should allow an “I don’t know” response and that instructions for the test should...
LLM’s learn from feedback during training. The authors recommend that during post-training, exam questions should allow an “I don’t know” response and that instructions for the test should explicitly state how much a wrong answer is penalized. It would at least stop training LLM’s to guess at answers when the probability of guessing right is low.
They admit that classifying responses as right, wrong, or “I don’t know” is still a simplification, but it’s better than what most benchmarks do now. Further improvements are left for someone else to research. They do point out that outputting a degree of confidence might result in unhelpful answers, like “I’m 1/365 certain that Kalai’s birthday is March 7th.”
(You’re describing the pre-training task which is predicting the next token, but post-training involves higher-level considerations.)
The article does not explain what comes after that pre-training. Sure, water is wet I guess, but how does the LLM determine what is an error? Maybe there's a way, I'm no LLM expert, but there's...
(You’re describing the pre-training task which is predicting the next token, but post-training involves higher-level considerations.)
The article does not explain what comes after that pre-training.
There is a straightforward fix. Penalize confident errors more than you penalize uncertainty
Sure, water is wet I guess, but how does the LLM determine what is an error? Maybe there's a way, I'm no LLM expert, but there's nothing of substance here.
That is a whole paper from OpenAI that says literally nothing new. There's 0 technical insight, it just rehashes what we (people susceptible to read posts from OpenAI) already know, and it reads exactly like a bland answer straight out from ChatGPT. A lot of words to say nothing, with "conclusions" that could come from 2022.
Yes, it assumes the reader is already familiar with it. I'm not a machine learning expert, but I believe there is instruction tuning (which is how it learns to follow directions instead of just...
The article does not explain what comes after that pre-training.
Yes, it assumes the reader is already familiar with it. I'm not a machine learning expert, but I believe there is instruction tuning (which is how it learns to follow directions instead of just autocomplete) and RLHF (which stands for Reinforcement Learning from Human Feedback.) And probably other things by now.
If you're not familiar enough with the field to even understand the terminology, how can you be sure it's "a lot of words to say nothing?"
Because it boils down to 2 points: What are hallucinations and what causes them. I think everyone with minimal interest in LLMs in 2025 knows what they are. How to reduce them. And there's 0...
Because it boils down to 2 points:
What are hallucinations and what causes them. I think everyone with minimal interest in LLMs in 2025 knows what they are.
How to reduce them. And there's 0 actual answer here.
I got nothing out of this article. It's written for people who don't know what are hallucinations, so if I were more familiar with the field, I'd still get nothing.
OpenAI's playground used to have an option to colour code the response according to the logprobs for the selected tokens. It was pretty interesting to explore. They removed that for some reason,...
OpenAI's playground used to have an option to colour code the response according to the logprobs for the selected tokens. It was pretty interesting to explore. They removed that for some reason, but there are still options to get at that data without a ton of effort like https://github.com/hobson/problog .
Definitely there are times when the probability distribution that's being sampled from is highly ambiguous -- I've seen cases where asking a yes or no question had about equal chances of selecting yes or no. A lot of the time though, the measurable ambiguity is just in the phrasing. I was looking at a case recently where the model was primarily hung up on deciding whether to say "died", "passed away", or "committed suicide" when the actual underlying truth was it was conflating two different children of Oppenheimer, and was trying to tell me that Peter Oppenheimer was dead when in reality he is currently living and his sister Katherine committed suicide. According to the probabilities, it was quite entirely convinced in the moment that Peter was dead.
A while ago I told chatGPT to start each answer with it's confidence level. I haven't decided if it's actually helpful or accurate yet. It has never given a confidence rating below 50%. But since...
A while ago I told chatGPT to start each answer with it's confidence level.
I haven't decided if it's actually helpful or accurate yet. It has never given a confidence rating below 50%. But since GPT5 I can't remember seeing an obvious hallucination yet either.
The problem, I think, is when you ask it to rate its answers, it will hallucinate that value as much as the rest of the answer. The confidence rating must be a metadata.
The problem, I think, is when you ask it to rate its answers, it will hallucinate that value as much as the rest of the answer. The confidence rating must be a metadata.
I want to preface by hedging that it’s been quite some time since I’ve engaged in math research, and some of the concepts fly over my head. I think the authors’ hypothesis is a reasonable one, so...
I want to preface by hedging that it’s been quite some time since I’ve engaged in math research, and some of the concepts fly over my head.
I think the authors’ hypothesis is a reasonable one, so I wanted to skim through the paper to see what new evidence they had to arrive at this conclusion. I was very disappointed by the actual text of the paper, since it seems like nothing more than a well-defined hypothesis.
Mathematically, it seemed like the paper made a lot of assumptions about how LLMs and the training process worked in order to arrive at its conclusion, without a whole lot of evidence or citations showing why those assumptions were valid. I won’t get into the weeds since I’m probably wrong about the details and I think I can make my argument without those details.
See, modern ML research is a lot more like science than math. Making unjustified assumptions about underlying functions and distributions is how most research is done; then, you design and perform experiments to try and develop something meaningful regardless of your assumptions. The most surprising thing about this paper was the lack of any meaningful experiments or data to justify its conclusions. Especially surprising considering that OpenAI definitely has the resources to run experiments on models.
That brings me to the kicker: if OpenAI had actually figured out methodology to reduce hallucinations, it would be a trade secret and would never have been published. It’s possible that’s what happened. Perhaps this paper originally had made progress on hallucinations, but was censored for business reasons. It could have been published anyway for internal political reasons, as signals to investors, or for public attention.
It’s also possible that they know their hypothesis is wrong from experiments, and they are publishing anyway to throw competing model developers off their scent and give them a red herring. This is the problem with profit-driven businesses being responsible for research. There are already so many perverse incentives about reputation in academia, and if you introduce the monetary and political incentives of business, it becomes so hard to tell if anything published is legitimate.
I think that's a bit too cynical. Yes, they aren't revealing their trade secrets, but they are bragging that GPT-5 does better on hallucinations (on the landing page, not in the paper) and they're...
I think that's a bit too cynical. Yes, they aren't revealing their trade secrets, but they are bragging that GPT-5 does better on hallucinations (on the landing page, not in the paper) and they're advocating for better benchmarks that show that, which both helps them and improves incentives for everyone.
I skipped the math, but it seems to me that the recommended changes are pretty transparent and make sense conceptually. Rewarding competing AI teams for creating models that don't guess at answers seems good? How much proof is really needed?
The crux of it: In real life, penalties for guessing are context-dependent. In some contexts, like cooking, guessing is harmless: worst case scenario you ruin a meal. In other contexts, like...
The crux of it:
As another example, suppose a language model is asked for someone’s birthday but doesn’t know. If it guesses “September 10,” it has a 1-in-365 chance of being right. Saying “I don’t know” guarantees zero points. Over thousands of test questions, the guessing model ends up looking better on scoreboards than a careful model that admits uncertainty.
There is a straightforward fix. Penalize confident errors more than you penalize uncertainty, and give partial credit for appropriate expressions of uncertainty...
In real life, penalties for guessing are context-dependent. In some contexts, like cooking, guessing is harmless: worst case scenario you ruin a meal. In other contexts, like aerospace engineering, accuracy is paramount; lives are at stake.
The conclusions section is interesting:
Finding: Accuracy will never reach 100% because, regardless of model size, search and reasoning capabilities, some real-world questions are inherently unanswerable.
Truth maybe exists only in the pure domains of math and logic: outside of them, truth is merely a lossy compressed descriptions of our infinitely detailed reality, and the various ways we represent and transmit those descriptions will always leave out detail.
The authors of this new paper show that, while hallucinations will inevitably show up when pre-training an LLM due to the nature of the task (nothing is labeled as invalid), this is not true of post-training. Instead, it’s an artifact of benchmarks that don’t penalize wrong answers more than not answering a question. Furthermore, post-training often makes the problem worse.
They say that adding new benchmarks won’t help. The scoring on existing benchmarks needs to be changed so they don’t penalize abstaining the same as guessing wrong.
There’s a self-interested aspect to this because OpenAI recently released GPT-5, which does better at avoiding hallucinations. If the benchmarks are changed then it will get a higher score on leaderboards. But it seems like it’s still the right thing to do?
The paper is here.
Ok but how can a LLM determine if a statement is true? They already output the most statistically plausible answers; from their "point of view" all their responses are the truth.
Technically, all answers are hallucinations. They're not different than any other output, it's just the name we give to answers that we know to be wrong. Maybe LLMs could add a "percentage of confidence" to their answers, instead of confidently spewing lies? Sure it will make them sound like robots, but that's what they are, and it wouldn't be a bad thing to remind users.
If done correctly, it wouldn't even require the LLM to output a raw percent confidence. After all, people do this all the time. "I'm not exactly sure, but I think...," or "I feel like maybe that's how that works," or "I'm like 75% certain that's the answer."
No one likes an all-confident know-it-all. Especially one that doesn't actual know what they're talking about. Yet that's exactly how LLMs behave. And people eat it up without question. Me included, at times. Kinda weird now that I think about it.
People don't like know-it-alls not because they don't like knowledge but because they don't like being talked down to.
Starting off a sentence with "I think you're on the right path but, in my experience, I think..." primes people to accept what you're saying - the "Yes, and" instead of the "No, but".
LLMs always start by buttering you up and then blowing the smoke.
I have a permanent CoPilot prompt to stfu with the compliments and brownnosing. It ignores it like half the time.
LLM’s learn from feedback during training. The authors recommend that during post-training, exam questions should allow an “I don’t know” response and that instructions for the test should explicitly state how much a wrong answer is penalized. It would at least stop training LLM’s to guess at answers when the probability of guessing right is low.
They admit that classifying responses as right, wrong, or “I don’t know” is still a simplification, but it’s better than what most benchmarks do now. Further improvements are left for someone else to research. They do point out that outputting a degree of confidence might result in unhelpful answers, like “I’m 1/365 certain that Kalai’s birthday is March 7th.”
(You’re describing the pre-training task which is predicting the next token, but post-training involves higher-level considerations.)
The article does not explain what comes after that pre-training.
Sure, water is wet I guess, but how does the LLM determine what is an error? Maybe there's a way, I'm no LLM expert, but there's nothing of substance here.
That is a whole paper from OpenAI that says literally nothing new. There's 0 technical insight, it just rehashes what we (people susceptible to read posts from OpenAI) already know, and it reads exactly like a bland answer straight out from ChatGPT. A lot of words to say nothing, with "conclusions" that could come from 2022.
Yes, it assumes the reader is already familiar with it. I'm not a machine learning expert, but I believe there is instruction tuning (which is how it learns to follow directions instead of just autocomplete) and RLHF (which stands for Reinforcement Learning from Human Feedback.) And probably other things by now.
If you're not familiar enough with the field to even understand the terminology, how can you be sure it's "a lot of words to say nothing?"
Because it boils down to 2 points:
I got nothing out of this article. It's written for people who don't know what are hallucinations, so if I were more familiar with the field, I'd still get nothing.
OpenAI's playground used to have an option to colour code the response according to the logprobs for the selected tokens. It was pretty interesting to explore. They removed that for some reason, but there are still options to get at that data without a ton of effort like https://github.com/hobson/problog .
Definitely there are times when the probability distribution that's being sampled from is highly ambiguous -- I've seen cases where asking a yes or no question had about equal chances of selecting yes or no. A lot of the time though, the measurable ambiguity is just in the phrasing. I was looking at a case recently where the model was primarily hung up on deciding whether to say "died", "passed away", or "committed suicide" when the actual underlying truth was it was conflating two different children of Oppenheimer, and was trying to tell me that Peter Oppenheimer was dead when in reality he is currently living and his sister Katherine committed suicide. According to the probabilities, it was quite entirely convinced in the moment that Peter was dead.
A while ago I told chatGPT to start each answer with it's confidence level.
I haven't decided if it's actually helpful or accurate yet. It has never given a confidence rating below 50%. But since GPT5 I can't remember seeing an obvious hallucination yet either.
The problem, I think, is when you ask it to rate its answers, it will hallucinate that value as much as the rest of the answer. The confidence rating must be a metadata.
I think that might be okay as long as it thinks through its answer before stating its confidence level. Just in case, I'd ask it to put it last.
I want to preface by hedging that it’s been quite some time since I’ve engaged in math research, and some of the concepts fly over my head.
I think the authors’ hypothesis is a reasonable one, so I wanted to skim through the paper to see what new evidence they had to arrive at this conclusion. I was very disappointed by the actual text of the paper, since it seems like nothing more than a well-defined hypothesis.
Mathematically, it seemed like the paper made a lot of assumptions about how LLMs and the training process worked in order to arrive at its conclusion, without a whole lot of evidence or citations showing why those assumptions were valid. I won’t get into the weeds since I’m probably wrong about the details and I think I can make my argument without those details.
See, modern ML research is a lot more like science than math. Making unjustified assumptions about underlying functions and distributions is how most research is done; then, you design and perform experiments to try and develop something meaningful regardless of your assumptions. The most surprising thing about this paper was the lack of any meaningful experiments or data to justify its conclusions. Especially surprising considering that OpenAI definitely has the resources to run experiments on models.
That brings me to the kicker: if OpenAI had actually figured out methodology to reduce hallucinations, it would be a trade secret and would never have been published. It’s possible that’s what happened. Perhaps this paper originally had made progress on hallucinations, but was censored for business reasons. It could have been published anyway for internal political reasons, as signals to investors, or for public attention.
It’s also possible that they know their hypothesis is wrong from experiments, and they are publishing anyway to throw competing model developers off their scent and give them a red herring. This is the problem with profit-driven businesses being responsible for research. There are already so many perverse incentives about reputation in academia, and if you introduce the monetary and political incentives of business, it becomes so hard to tell if anything published is legitimate.
I think that's a bit too cynical. Yes, they aren't revealing their trade secrets, but they are bragging that GPT-5 does better on hallucinations (on the landing page, not in the paper) and they're advocating for better benchmarks that show that, which both helps them and improves incentives for everyone.
I skipped the math, but it seems to me that the recommended changes are pretty transparent and make sense conceptually. Rewarding competing AI teams for creating models that don't guess at answers seems good? How much proof is really needed?
The crux of it:
In real life, penalties for guessing are context-dependent. In some contexts, like cooking, guessing is harmless: worst case scenario you ruin a meal. In other contexts, like aerospace engineering, accuracy is paramount; lives are at stake.
The conclusions section is interesting:
Truth maybe exists only in the pure domains of math and logic: outside of them, truth is merely a lossy compressed descriptions of our infinitely detailed reality, and the various ways we represent and transmit those descriptions will always leave out detail.