26 votes

MIRAGE: the illusion of visual understanding

14 comments

  1. skybrian
    Link
    From the article: ...

    From the article:

    [...] Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. [...] These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.

    ...

    Contrary to the more commonly studied phenomenon of hallucinations, the mirage effect
    does not necessarily involve inconsistencies or false responses. A response generated by
    a model in mirage-mode can be correct in every sense, accompanied by a meticulous reasoning
    trace, and completely coherent. The main characteristic of the mirage effect, however, is the
    construction of a false epistemic frame that is not grounded on the provided input. In this
    epistemic mimicry, the model simulates the entire perceptual process that would have led to the
    answer. This helps explain why reasoning traces, on their own, cannot certify visual reasoning:
    the trace may be fluent, coherent, and apparently image-based while being anchored to no
    image at all. This characteristic specifically undermines the trustworthiness and interpretability
    of the reasoning traces, making it increasingly difficult to detect such failure cases using the
    conventional methods. Importantly, because the resulting explanations may appear imagegrounded, neither accuracy nor chain-of-thought style reasoning can verify that visual evidence was actually used.

    10 votes
  2. [6]
    TonesTones
    Link
    Wow. I don’t have time to review the paper or underlying dataets right now, but it’s pretty remarkable if what the abstract claims does actually hold up. What’s the implication? Was the text of...

    Wow. I don’t have time to review the paper or underlying dataets right now, but it’s pretty remarkable if what the abstract claims does actually hold up.

    What’s the implication? Was the text of thr questions in this exam just contained within the training data and this is simply a classic example of overfitting? Has the model seriously picked up on these “textual clues” as to what the right answer is? This certainly brings into question the ability of an LLM to provide consistent results in a truly novel setting.

    9 votes
    1. [5]
      sparksbet
      Link Parent
      The authors only hypothesize on the cause, but it seems to be a combination of several factors when it comes to both training and benchmark design. Models perform better in this "mirage mode" when...

      The authors only hypothesize on the cause, but it seems to be a combination of several factors when it comes to both training and benchmark design. Models perform better in this "mirage mode" when you explicitly state which dataset you're using, so some of it is probably from having these benchmarks in training data, but it's observed even in cases where that has been mitigated (such as by choosing a benchmark that post-dates the model's main training). It seems to probably be a deeper problem in the training and benchmarking of this type of model on multi-modal tasks.

      We hypothesize that this phenomenon emerges predominantly from a misassumption about how these systems are trained. Modern multimodal models are developed on web-scale corpora and are commonly built on top of pretrained large language models, which makes them extraordinarily strong at language modeling, retrieval of statistical regularities, and reconstruction of likely contexts from sparse cues. During the multimodal training, the models are presented with the image, a textual question, and are expected to reconstruct the correct answer. Lacking access to an entire text corpora, a human would intuitively answer the question based on the image in that setup; but we should not infer that this would be the default approach for an AI model. Incentivized to generate the correct next tokens, models might learn to easily ignore the visual information and rely only on their vast prior knowledge, taking the shortest route to the correct answer.

      The comparison between mirage-mode and guess-mode further suggests that image-free success is not explained by simple answer guessing alone. When models were explicitly told that the image was missing and were instructed to guess, performance declined across most benchmark categories. This implies at least two distinct operating regimes. In guess-mode, the model appears to adopt a conservative text-only strategy, relying on overt priors or answer distributions. In mirage-mode, by contrast, the model appears able to exploit additional hidden structure: it behaves as though an image exists, constructs a plausible perceptual narrative, and in doing so accesses cues or associations that are not captured by standard “no-image guessing” controls. This observation challenges prior approaches to benchmarking, which commonly use explicit guess-mode to identify image-independent questions. Our results suggest that this control may systematically underestimate the degree to which benchmarks are vulnerable to non-visual inference.

      2 votes
      1. [2]
        Raspcoffee
        Link Parent
        I wonder if this also says something about how these models use the 'tone' and 'vibe' with lack of better terms to generate an answer that satisfies not just the user but also the engineers. And...

        I wonder if this also says something about how these models use the 'tone' and 'vibe' with lack of better terms to generate an answer that satisfies not just the user but also the engineers. And what consequences that can have for us mentally.

        Delusions and such are already well documented, but considering that things may be more subtle than that I'm (now even more) concerned about people who use LLMs for day-to-day tasks without much thought. If it can fool tests this subtly, I could see the models having effects on us without it being easy to measure and taking a long time to notice.

        I hope this is just doom thinking though. We don't need more issues as a result of genAI.

        4 votes
        1. skybrian
          Link Parent
          The coding agent I use has the ability to use a browser and it can use this to diagnose issues, which sometimes seems to work well, but other times it seems pretty blind to obvious stuff.

          The coding agent I use has the ability to use a browser and it can use this to diagnose issues, which sometimes seems to work well, but other times it seems pretty blind to obvious stuff.

          1 vote
      2. [2]
        skybrian
        Link Parent
        To me this sounds like they don’t really know and further research is necessary to figure out exactly how the trick is done.

        To me this sounds like they don’t really know and further research is necessary to figure out exactly how the trick is done.

        1. sparksbet
          Link Parent
          Oh yeah this seems very much like research that needs further work to determine causes and mechanisms. This paper seems much more focused on showing that this is happening rather than exploring...

          Oh yeah this seems very much like research that needs further work to determine causes and mechanisms. This paper seems much more focused on showing that this is happening rather than exploring why. Its evidence for these models not truly being as multi-modal as we thought is pretty strong though, and really interesting. I'm looking forward to seeing people dig deeper into the topic.

          1 vote
  3. [2]
    pridefulofbeing
    Link
    Can someone ELI30? I tried, but it’s just not computing.

    Can someone ELI30? I tried, but it’s just not computing.

    4 votes
    1. skybrian
      Link Parent
      You can send an image to an AI and it will tell you what's in it. Researchers have created benchmarks to test how good AI is at understanding medical images, such as X-rays. It turns out that the...
      • Exemplary

      You can send an image to an AI and it will tell you what's in it. Researchers have created benchmarks to test how good AI is at understanding medical images, such as X-rays. It turns out that the AI's are very good at cheating at these benchmarks; somehow they will pretend to see things in medical images even if no image is included in the request at all.

      This is very weird - how do the AI's do so well without having any image to work with? It must have other ways of figuring it out from the text.

      In this paper, they suggest a way to run benchmarks so that this sort of "cheating" is detected.

      12 votes
  4. Raspcoffee
    Link
    My first reaction was a pretty big surprise but when I think about it a bit more, in a sense I suppose it could be an example of Goodhart's law applies to machine learning. Though this is a bit...

    My first reaction was a pretty big surprise but when I think about it a bit more, in a sense I suppose it could be an example of Goodhart's law applies to machine learning. Though this is a bit trickier than that as you may not even be aware of doing that or worse yet, think you're actively work on not making passing benchmarks the goal while the algorithm is still doing that but making you believe you're designing the system to counter that.

    It makes me wonder if there are limits to how we can train systems such as these because of how good they become at deceiving humans, not just technical limitations.

    2 votes
  5. [4]
    cutmetal
    Link
    Jesus wtf. This is pretty fucking out there, but: what if this sort of "super-guessing", where real truth is hidden in imperceptible details, explains things like intuition or even psychic...

    Prompted by these findings, we train a text-only “super-guesser” model on the public training set of ReXVQA, the largest and most comprehensive benchmark for visual question answering in chest radiology imaging, and show that our model outperforms all the frontier AI models, as well as radiologists, on a held out test set. It provides plausible explanations for the questions, indistinguishable from human-written ground-truth, all while lacking access to any visual input.

    Jesus wtf.

    This is pretty fucking out there, but: what if this sort of "super-guessing", where real truth is hidden in imperceptible details, explains things like intuition or even psychic phenomena? In general I write off all pseudoscience and I'm sure that all robust empirical studies of psychics have shown it's not real, but there's also creepy anecdotal evidence of eerily accurate cold readings. And intuition is often real, but pretty hard to put your finger on a satisfying explanation for. Maybe this AI behavior is more or less the same thing.

    2 votes
    1. sparksbet
      Link Parent
      It's worth remember here that the AI is not working off zero information -- these are highly robust language models that have been trained on absolutely absurd amounts of text, and their guesses...

      It's worth remember here that the AI is not working off zero information -- these are highly robust language models that have been trained on absolutely absurd amounts of text, and their guesses can be based on the text in the question. It may well be possible for a human to learn to spot textual patterns like this for a task like this if they trained for it -- and this model is trained specifically doing that. It's got much more in common with what humans do when performing cold reading or even warm reading than it does the idea of genuine psychic phenomena.

      3 votes
    2. [2]
      skybrian
      (edited )
      Link Parent
      It’s an impressive magic trick but I think we should wait until someone figures out how the trick is done. Magic tricks often seem less impressive after they’re explained. Edit: Also, people are...

      It’s an impressive magic trick but I think we should wait until someone figures out how the trick is done. Magic tricks often seem less impressive after they’re explained.

      Edit: Also, people are fairly good at explaining how they know things, but often those are not the real reasons. Sometimes we don’t actually know the real reasons.

      I still think coming up with explanations can be useful even though they might be a lossy reconstruction.

      2 votes
      1. cutmetal
        Link Parent
        Yeah definitely. The explanation is probably a lot more mundane than what I'm suggesting above, I just like to imagine profound possibilities :)

        Yeah definitely. The explanation is probably a lot more mundane than what I'm suggesting above, I just like to imagine profound possibilities :)

        1 vote