[...] Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. [...] These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
...
Contrary to the more commonly studied phenomenon of hallucinations, the mirage effect
does not necessarily involve inconsistencies or false responses. A response generated by
a model in mirage-mode can be correct in every sense, accompanied by a meticulous reasoning
trace, and completely coherent. The main characteristic of the mirage effect, however, is the
construction of a false epistemic frame that is not grounded on the provided input. In this
epistemic mimicry, the model simulates the entire perceptual process that would have led to the
answer. This helps explain why reasoning traces, on their own, cannot certify visual reasoning:
the trace may be fluent, coherent, and apparently image-based while being anchored to no
image at all. This characteristic specifically undermines the trustworthiness and interpretability
of the reasoning traces, making it increasingly difficult to detect such failure cases using the
conventional methods. Importantly, because the resulting explanations may appear imagegrounded, neither accuracy nor chain-of-thought style reasoning can verify that visual evidence was actually used.
Wow. I don’t have time to review the paper or underlying dataets right now, but it’s pretty remarkable if what the abstract claims does actually hold up. What’s the implication? Was the text of...
Wow. I don’t have time to review the paper or underlying dataets right now, but it’s pretty remarkable if what the abstract claims does actually hold up.
What’s the implication? Was the text of thr questions in this exam just contained within the training data and this is simply a classic example of overfitting? Has the model seriously picked up on these “textual clues” as to what the right answer is? This certainly brings into question the ability of an LLM to provide consistent results in a truly novel setting.
From the article:
...
Wow. I don’t have time to review the paper or underlying dataets right now, but it’s pretty remarkable if what the abstract claims does actually hold up.
What’s the implication? Was the text of thr questions in this exam just contained within the training data and this is simply a classic example of overfitting? Has the model seriously picked up on these “textual clues” as to what the right answer is? This certainly brings into question the ability of an LLM to provide consistent results in a truly novel setting.