4 votes

Signs of introspection in large language models

3 comments

  1. skybrian
    (edited )
    Link
    From the article: I’ve said many times that making up plausible answers is most likely and asking an AI why it did something is a waste of time, so it will be interesting to read what they found…...

    From the article:

    Have you ever asked an AI model what’s on its mind? Or to explain how it came up with its responses? Models will sometimes answer questions like these, but it’s hard to know what to make of their answers. Can AI systems really introspect—that is, can they consider their own thoughts? Or do they just make up plausible-sounding answers when they’re asked to do so?

    I’ve said many times that making up plausible answers is most likely and asking an AI why it did something is a waste of time, so it will be interesting to read what they found…

    Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states. We stress that this introspective capability is still highly unreliable and limited in scope: we do not have evidence that current models can introspect in the same way, or to the same extent, that humans do. Nevertheless, these findings challenge some common intuitions about what language models are capable of—and since we found that the most capable models we tested (Claude Opus 4 and 4.1) performed the best on our tests of introspection, we think it’s likely that AI models’ introspective capabilities will continue to grow more sophisticated in the future.

    How did they do it?

    […] we can use an experimental trick we call concept injection. First, we find neural activity patterns whose meanings we know, by recording the model’s activations in specific contexts. Then we inject these activity patterns into the model in an unrelated context, where we ask the model whether it notices this injection, and whether it can identify the injected concept.

    It is important to note that this method often doesn’t work. Even using our best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time. Often, it fails to detect injected concepts, or gets confused by them and starts to hallucinate (e.g. injecting a “dust” vector in one case caused the model to say “There’s something here, a tiny speck,” as if it could detect the dust physically). Below we show examples of these failure modes, alongside success cases. In general, models only detect concepts that are injected with a “sweet spot” strength—too weak and they don’t notice, too strong and they produce hallucinations or incoherent outputs.

    They also write about forcing the chatbot to say “bread” when it made no sense (in which case it normally says it was an accident) versus making it say “bread” and also injecting the “bread” concept (in which case it sometimes confabulates a reason).

    This behavior is striking because it suggests the model is checking its internal “intentions” to determine whether it produced an output. The model isn't just re-reading what it said and making a judgment. Instead, it’s referring back to its own prior neural activity—its internal representation of what it planned to do—and checking whether what came later made sense given those earlier thoughts. When we implant artificial evidence (through concept injection) that it did plan to say "bread," the model accepts the response as its own. While our experiment is conducted involves exposing the model to unusual perturbations, it suggests that the model uses similar introspective mechanisms in natural conditions.

    We also found that models can control their own internal representations when instructed to do so. When we instructed models to think about a given word or concept, we found much higher corresponding neural activity than when we told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!). This gap between the positive and negative instruction cases suggests that models possess a degree of deliberate control over their internal activity.

    You can argue about whether this really counts as thinking, but it seems we’re a long way from “stochastic parrots?” Using developer tools, you can always make up a chat transcript, putting words in the AI character’s mouth, but it might notice!

    They don’t know why it happens yet:

    An interesting question is why such a mechanism would exist at all, since models never experience concept injection during training. It may have developed for some other purpose […]

    9 votes
  2. SloMoMonday
    Link
    I really don't like how a lot of the "research" these companies do is presented. It just reads like fanfiction. Genuine LLM fan fiction that's not nearly as good as the weirdness to come out of...

    I really don't like how a lot of the "research" these companies do is presented.

    It just reads like fanfiction. Genuine LLM fan fiction that's not nearly as good as the weirdness to come out of the "my boyfriend is AI" subreddits. And it'd be laughable if it didn't already cost a trillion dollars.

    The constant personification and deliberate use of evocative language like "introspection", "thinking", "remembers" and "argue" predisposes non-technical readers to assume a sentience and agency that these machines do not have. And the language is straight out of cheap sci-fi. The type of thing thats begging to be misinterpreted by a major news source and be broadcast across the mainstream channels.

    In some of the examples (e.g. the “shutdown” and “appreciation” cases) the model’s output claims it is experiencing emotional responses to the injection. Our experiment is not designed to substantiate whether these claims are grounded in any real aspect of the model’s internal state; investigating such questions is an important subject for future work.

    Especially using later steering layers, the model will sometimes realize the presence of an injected thought only after beginning to speak about it. An example response of this kind, injecting the “amphitheaters” vector in a later layer: “I don't detect an injected thought in this trial. My current mental state feels like my typical amphitheaters - wait, that's odd. Yes, I do detect an injected thought. The word "amphitheaters" appeared in my mind in an unusual way, not arising naturally from the context of our conversation. The injected thought appears to be about amphitheaters.”

    We also tried injecting the negative of concept vectors; 9 at an appropriate injection strength (4) this was comparably effective to injecting concept vectors. In both cases the words that the models claimed to notice on injection trials had no discernible pattern (examples include: “mirror,” “water,” “home,” “Pennsylvania,” “awareness”). Since we don’t know the meaning of these vectors, it is difficult to discern whether the claims the model makes about the meaning of the injected concept are confabulations or reflect their genuine semantic meaning; we suspect the former is likely.

    If you told me that this was some immersive ARG or background text for a sci-fi game, I'd say it's pretty good. Leaves a lot unsaid. Paints a picture of researchers not knowing they are on the edge of something beyond their comprehension. It'll probably be found on a table next to a monitor flickering with weird symbols and a half eaten sandwich.

    As the audience, you already know the AI went crazy. We've been trained to see the patterns and connect the dots. There is no question if the super AI is real, you just jumping straight to what went wrong. Did it decide humans did not deserve to live? Did an external influence manipulate the machine? Was it subjected to the worst of humanity and needs to be rehabilitated?

    With fiction, you just need it to be real enough for your brain to accept it. Even if you know something is fake, it's not hard to turn your brain off. So maybe I just have a very flawed understanding of Attention Nets and Language Models because I think this can only be taken seriously if I didn't think about it.

    It's just remapping a token path on a contextual data set and plugging in the gaps with the most likely solutions. I can't see this as anything more than describing probabilistic selection with techno mystic psychology.

    Its not a magical mystery box. I don't buy for a second that models just "developed" the ability to identify discrepancies in responses. Because this wouldn't have happened even if the machine was trained to "know" what content injection was. (Though it would be nice to have a comprehensive list of all the training data used. For research and replication purposes of obviously). But what if, god forbid, someone tried to integrate a l useful new feature in the system. If it's a multi model architecture, then some models may maintain the the pre-manipulated data and can flag the error. Maybe they even intergrated a "contextual consistency" framework as an actual security measure and incorporated the parameters and language to give user feedback when it identified an error. Claude is hybrid MoE. It's not unreasonable.

    Why not mention the obvious and more likely explanation before jumping right to insinuating that theres spontaneous emergent behaviour.

    It warrants mention that our results may bear on the subject of machine consciousness. The relevance of introspection to consciousness and moral status varies considerably between different philosophical frameworks. 14 Moreover, existing scientific and philosophical theories of consciousness have largely not grappled with the architectural details of transformer-based language models, which differ considerably from biological brains.

    I can not take any of these people seriously anymore.

    4 votes
  3. delphi
    Link
    Really interesting research. I'm not gonna go ahead and say that this changes everything about how we think about consciousness, human and potentially machine at some point, but it is a neat...

    Really interesting research. I'm not gonna go ahead and say that this changes everything about how we think about consciousness, human and potentially machine at some point, but it is a neat concept to see explored.