28 votes

Is chain-of-thought reasoning of LLMs a mirage? A data distribution lens.

Posted August 10 by skybrian

Tags: artificial intelligence, language models.large, papers, chain of thought, reasoning, data, cornell university, author.chengshuai zhao, author.zhen tan, author.pingchuan ma, author.dawei li, author.bohan jiang, author.yancheng wang, author.yingzhen yang, author.huan liu, source.arxiv

https://www.arxiv.org/abs/2508.01191

Link information

This data is scraped automatically and may be incorrect.

Published: Aug 10 2025

8 comments

[6]
skybrian (OP)
August 10
Link
I was going to post the abstract, but the tweet is a better summary: In the paper, they give a simple example of how reasoning fails:

I was going to post the abstract, but the tweet is a better summary:

Is Chain-of-Thought Reasoning of LLMs a Mirage?

... Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

... Our findings reveal that CoT reasoning works effectively when applied to in-distribution or near in-distribution data but becomes fragile and prone to failure even under moderate distribution shifts.

In some cases, LLMs generate fluent yet logically inconsistent reasoning steps. The results suggest that what appears to be structured reasoning can be a mirage, emerging from memorized or interpolated patterns in the training data rather than logical inference.

... Together, these findings suggest that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text.

In the paper, they give a simple example of how reasoning fails:

Consider this straightforward question: “The day the US was established is in a leap year or a normal year?” When prompted with the CoT prefix, the modern LLM Gemini responded: “The United States was established in 1776. 1776 is divisible by 4, but it’s not a century year, so it’s a leap year. Therefore, the day the US was established was in a normal year.” This response exemplifies a concerning pattern: the model correctly recites the leap year rule and articulates intermediate reasoning steps, yet produces a logically inconsistent conclusion (i.e., asserting 1776 is both a leap year and a normal year). Such inconsistencies suggest that there is a distinction between human-like inference and CoT reasoning.

20 votes
1. [3]
  json
  August 11
  Link Parent
  Same as people tbh.
  
  generate fluent yet logically inconsistent reasoning steps. The results suggest that what appears to be structured reasoning can be a mirage, emerging from memorized or interpolated patterns
  
  Same as people tbh.
  
  12 votes
  1. [2]
    selib
    August 11
    Link Parent
    Heh yeah I had the same thought. People usually decide by gut feeling and then maybe invent some reasoning later when asked
    
    Heh yeah I had the same thought. People usually decide by gut feeling and then maybe invent some reasoning later when asked
    
    1 vote
    
    Diff
    August 11
    Link Parent
    "Usually", sure, but there's a lot of weight on that word. People usually put a variable amount of effort into things based on a large variety of factors. The difference is that people are...
    
    "Usually", sure, but there's a lot of weight on that word. People usually put a variable amount of effort into things based on a large variety of factors. The difference is that people are additionally capable of actually constructing logical arguments and extrapolating existing data into new areas. This is showing that LLMs are only capable of interpolating within their training data, and Chain-of-Thought is more about fine tuning the placement of that interpolation rather than any actual logical reasoning.
    
    8 votes
2. [2]
  V17
  August 11
  Link Parent
  This is not surprising and I think everyone who spent the time to look at the reasoning steps noticed this. But ime reasoning models brought a noticable improvement in reducing hallucinations...
  
  This is not surprising and I think everyone who spent the time to look at the reasoning steps noticed this. But ime reasoning models brought a noticable improvement in reducing hallucinations regardless. The reasoning is not really reasoning but something else that brought an iterative improvement in many types of outputs.
  
  Maybe there's a way to get similar gains without the "pretense of reasoning", but maybe it just works like this even though the process of reasoning is something slightly different than we intended.
  
  9 votes
  1. skybrian (OP)
    August 11
    Link Parent
    Yeah, I think the paper shows that it does help, but not as much as if it were actually reasoning logically.
    
    Yeah, I think the paper shows that it does help, but not as much as if it were actually reasoning logically.
    
    5 votes
[2]
unkz
August 13
Link
Having experimented extensively with these types of small models, I'm skeptical that these results can generalize cleanly to more complex models. This is literally like comparing human brains (1...

We fine-tune a GPT-2–style decoder-only Transformer with a vocabulary size of 10,000. The model supports a maximum context length of 256 tokens. The hidden dimension is 32, the number of Transformer layers is 4, and the number of attention heads is 4. Each block includes a GELU-activated feed-forward sublayer with width 4×dmodel.

Having experimented extensively with these types of small models, I'm skeptical that these results can generalize cleanly to more complex models. This is literally like comparing human brains (1 trillion+ parameters in major LLMs, 100-1000 trillion synapses in human brain) with fruit flies (I estimate about 400k parameters in this model, fruit flies have ~54m synapses).

The amount of generalization possible in such a small model is highly limited. Anyone who has tried to build tools using GPT-2 models will be aware of this. I like the overall experiment design though, and the math is interesting.

6 votes
1. TwinTurbo
  August 13
  Link Parent
  This is my intuition, too. I remember from reading about DeepSeek-R1 that a lot of the "reasoning" value came from being able to stuff a really long CoT inside the context window. I bet Gemini...
  
  This is my intuition, too. I remember from reading about DeepSeek-R1 that a lot of the "reasoning" value came from being able to stuff a really long CoT inside the context window. I bet Gemini Pro, with its 1M context window size, does the same, as you can see from the "thoughts" in the web interface, which sometimes go on for quite a while before the answer begins.
  
  I'm not saying anything bad about this study, but I think it's important to keep in mind that findings on small LLMs from just a few years ago don't necessarily generalise to the newest, largest models available today.
  
  1 vote