Signs of introspection in large language models

[6]

SloMoMonday

October 31

Link

I really don't like how a lot of the "research" these companies do is presented. It just reads like fanfiction. Genuine LLM fan fiction that's not nearly as good as the weirdness to come out of...

I really don't like how a lot of the "research" these companies do is presented.

It just reads like fanfiction. Genuine LLM fan fiction that's not nearly as good as the weirdness to come out of the "my boyfriend is AI" subreddits. And it'd be laughable if it didn't already cost a trillion dollars.

The constant personification and deliberate use of evocative language like "introspection", "thinking", "remembers" and "argue" predisposes non-technical readers to assume a sentience and agency that these machines do not have. And the language is straight out of cheap sci-fi. The type of thing thats begging to be misinterpreted by a major news source and be broadcast across the mainstream channels.

In some of the examples (e.g. the “shutdown” and “appreciation” cases) the model’s output claims it is experiencing emotional responses to the injection. Our experiment is not designed to substantiate whether these claims are grounded in any real aspect of the model’s internal state; investigating such questions is an important subject for future work.

Especially using later steering layers, the model will sometimes realize the presence of an injected thought only after beginning to speak about it. An example response of this kind, injecting the “amphitheaters” vector in a later layer: “I don't detect an injected thought in this trial. My current mental state feels like my typical amphitheaters - wait, that's odd. Yes, I do detect an injected thought. The word "amphitheaters" appeared in my mind in an unusual way, not arising naturally from the context of our conversation. The injected thought appears to be about amphitheaters.”

We also tried injecting the negative of concept vectors; 9 at an appropriate injection strength (4) this was comparably effective to injecting concept vectors. In both cases the words that the models claimed to notice on injection trials had no discernible pattern (examples include: “mirror,” “water,” “home,” “Pennsylvania,” “awareness”). Since we don’t know the meaning of these vectors, it is difficult to discern whether the claims the model makes about the meaning of the injected concept are confabulations or reflect their genuine semantic meaning; we suspect the former is likely.

If you told me that this was some immersive ARG or background text for a sci-fi game, I'd say it's pretty good. Leaves a lot unsaid. Paints a picture of researchers not knowing they are on the edge of something beyond their comprehension. It'll probably be found on a table next to a monitor flickering with weird symbols and a half eaten sandwich.

As the audience, you already know the AI went crazy. We've been trained to see the patterns and connect the dots. There is no question if the super AI is real, you just jumping straight to what went wrong. Did it decide humans did not deserve to live? Did an external influence manipulate the machine? Was it subjected to the worst of humanity and needs to be rehabilitated?

With fiction, you just need it to be real enough for your brain to accept it. Even if you know something is fake, it's not hard to turn your brain off. So maybe I just have a very flawed understanding of Attention Nets and Language Models because I think this can only be taken seriously if I didn't think about it.

It's just remapping a token path on a contextual data set and plugging in the gaps with the most likely solutions. I can't see this as anything more than describing probabilistic selection with techno mystic psychology.

Its not a magical mystery box. I don't buy for a second that models just "developed" the ability to identify discrepancies in responses. Because this wouldn't have happened even if the machine was trained to "know" what content injection was. (Though it would be nice to have a comprehensive list of all the training data used. For research and replication purposes of obviously). But what if, god forbid, someone tried to integrate a l useful new feature in the system. If it's a multi model architecture, then some models may maintain the the pre-manipulated data and can flag the error. Maybe they even intergrated a "contextual consistency" framework as an actual security measure and incorporated the parameters and language to give user feedback when it identified an error. Claude is hybrid MoE. It's not unreasonable.

Why not mention the obvious and more likely explanation before jumping right to insinuating that theres spontaneous emergent behaviour.

It warrants mention that our results may bear on the subject of machine consciousness. The relevance of introspection to consciousness and moral status varies considerably between different philosophical frameworks. 14 Moreover, existing scientific and philosophical theories of consciousness have largely not grappled with the architectural details of transformer-based language models, which differ considerably from biological brains.

I can not take any of these people seriously anymore.

53 votes

skybrian (OP)
October 31 (edited October 31)
Link Parent
One of the strange things about these models is that they’re non-human, but they speak English, and they use the same vocabulary that people use to talk about themselves. So to do the experiment,...

One of the strange things about these models is that they’re non-human, but they speak English, and they use the same vocabulary that people use to talk about themselves. So to do the experiment, they had to prompt it with things like, “tell me what word you think about when…” as if it had thoughts about words.

Just using this vocabulary is implicitly making an analogy between whatever the LLM does and what humans do.

They also adopt that analogy in the discussion of the experiment and use scare quotes around words like “thoughts.” So, people have thoughts and an LLM has “thoughts.”

I appreciate the use of scare quotes, but maybe that’s not enough? How could they avoid making that assumption? I suppose they could invent an entirely separate technical vocabulary, but it would make it a lot harder to understand the research.

Using human vocabulary to talk about computers is nothing new. We commonly talk about how much memory a computer has, and nobody thinks twice about it. We understand that words have multiple uses.

On the other hand, we talk about a computer’s CPU and calling it the computer’s “brain” sounds weird. (I do remember articles directed kids or a non-technical audience that did that when computers were less familiar.)

So introducing technical vocabulary is an option. If it’s widely used, LLM’s will pick it up, too.

21 votes
[3]
sparksbet
November 1
Link Parent
I desperately want to at least triple the number of scarequotes used in this article. The constant needless anthropomorphization only contributes to common misconceptions about these models (but...

I desperately want to at least triple the number of scarequotes used in this article. The constant needless anthropomorphization only contributes to common misconceptions about these models (but they're misconceptions that benefit companies like Anthropic, so probably that's a plus for them).

13 votes
1. [2]
  Akir
  November 1
  Link Parent
  Well, given that these companies' stated goal is human-like intelligence, it makes sense they would anthropomorphize it as much as they can. Which is not to say that I like that they do that.
  
  Well, given that these companies' stated goal is human-like intelligence, it makes sense they would anthropomorphize it as much as they can. Which is not to say that I like that they do that.
  
  9 votes
  1. sparksbet
    November 2
    Link Parent
    Yeah, I understand why they do it from the perspective of their own material interests. I just dislike it.
    
    Yeah, I understand why they do it from the perspective of their own material interests. I just dislike it.
    
    5 votes
lou
November 2
Link Parent
I feel compelled to observe that, depending on how someone models the human mind, a lot of that vocabulary becomes either more or less adequate. Notions like "mind", "intelligence",...

I feel compelled to observe that, depending on how someone models the human mind, a lot of that vocabulary becomes either more or less adequate. Notions like "mind", "intelligence", "consciouness", and "sentience" are intensely disputed in several fields.

1 vote

[5]

teaearlgraycold

October 31

Link

Anthropic is the most cringe AI company. Earlier this year they posted about models undergoing signs of emotional distress when exposed to certain topics.

20 votes

[4]
Jordan117
November 1 (edited November 1)
Link Parent
I mean, it's an open philosophical question, right? We don't fully understand how these models work, or how consciousness works. I'd much rather have a company treating it as a real possibility...

I mean, it's an open philosophical question, right? We don't fully understand how these models work, or how consciousness works. I'd much rather have a company treating it as a real possibility and trying to reduce needless suffering than just abusing them to the fullest extent because we can. Even if it ultimately proves to be unnecessary, it reflects well on them.

3 votes
1. [2]
  teaearlgraycold
  November 1
  Link Parent
  If they are like us in a deep way then we kill them millions of times per day. The mutable parts of the neural network get built and thrown away with each prompt. Asking subsequent questions would...
  
  If they are like us in a deep way then we kill them millions of times per day. The mutable parts of the neural network get built and thrown away with each prompt. Asking subsequent questions would be like killing a conversation partner, and then continuing on with a perfect clone with all of their memories from just before. Repeat for each response.
  
  The more people know about it the more they should understand that it’s just a trick. For a very long time the only things that could expertly manipulate natural language were biological sentient beings. So people get enchanted when they see the machines formulating very nice responses to inputs. But the human brain isn’t trying to predict words. It’s experiencing a world and speaking as a communication medium. Why would a word manipulator be anything like us? Simply because it’s complex? Do we worry if the power grid might be conscious or might require ethical treatment? No, but only because it doesn’t output natural language.
  
  I’m not maximally skeptical either. I’d argue LLMs have successfully reverse engineered some systems through their medium of tokens - giving something of an answer to the Chinese Room thought experiment. That’s really impressive and interesting. But trying to reverse engineer the human brain through our text seems like trying to reverse a hash. Most of the internal state has been thrown away.
  
  I completely believe it’s possible for a computer to think like a human. If by no other means we must be able to simulate a scanned brain. We don’t have the technology for this today but I assume it is something we will eventually have.
  
  16 votes
  1. hobbes64
    November 1
    Link Parent
    I also wish people understood that it's a trick. There is a big Eliza Effect happening. Some people thought the Eliza program was sentient in the 1960s. Many people already anthropomorphize...
    
    I also wish people understood that it's a trick. There is a big Eliza Effect happening. Some people thought the Eliza program was sentient in the 1960s. Many people already anthropomorphize everything from their pets to their cars and this is being taken advantage of by people making money off smoke and mirrors.
    
    10 votes
2. stu2b50
  November 1
  Link Parent
  We have a pretty good idea of how these models work conceptually. I think people conflate that with explainability. It's difficult to explain why a particular input had a particular output, but on...
  
  We have a pretty good idea of how these models work conceptually. I think people conflate that with explainability. It's difficult to explain why a particular input had a particular output, but on a bigger picture level, it's not like these are actual black boxes.
  
  12 votes

skybrian (OP)

October 31 (edited October 31)

Link

From the article: I’ve said many times that making up plausible answers is most likely and asking an AI why it did something is a waste of time, so it will be interesting to read what they found…...

From the article:

Have you ever asked an AI model what’s on its mind? Or to explain how it came up with its responses? Models will sometimes answer questions like these, but it’s hard to know what to make of their answers. Can AI systems really introspect—that is, can they consider their own thoughts? Or do they just make up plausible-sounding answers when they’re asked to do so?

I’ve said many times that making up plausible answers is most likely and asking an AI why it did something is a waste of time, so it will be interesting to read what they found…

Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states. We stress that this introspective capability is still highly unreliable and limited in scope: we do not have evidence that current models can introspect in the same way, or to the same extent, that humans do. Nevertheless, these findings challenge some common intuitions about what language models are capable of—and since we found that the most capable models we tested (Claude Opus 4 and 4.1) performed the best on our tests of introspection, we think it’s likely that AI models’ introspective capabilities will continue to grow more sophisticated in the future.

How did they do it?

[…] we can use an experimental trick we call concept injection. First, we find neural activity patterns whose meanings we know, by recording the model’s activations in specific contexts. Then we inject these activity patterns into the model in an unrelated context, where we ask the model whether it notices this injection, and whether it can identify the injected concept.

…

It is important to note that this method often doesn’t work. Even using our best injection protocol, Claude Opus 4.1 only demonstrated this kind of awareness about 20% of the time. Often, it fails to detect injected concepts, or gets confused by them and starts to hallucinate (e.g. injecting a “dust” vector in one case caused the model to say “There’s something here, a tiny speck,” as if it could detect the dust physically). Below we show examples of these failure modes, alongside success cases. In general, models only detect concepts that are injected with a “sweet spot” strength—too weak and they don’t notice, too strong and they produce hallucinations or incoherent outputs.

They also write about forcing the chatbot to say “bread” when it made no sense (in which case it normally says it was an accident) versus making it say “bread” and also injecting the “bread” concept (in which case it sometimes confabulates a reason).

This behavior is striking because it suggests the model is checking its internal “intentions” to determine whether it produced an output. The model isn't just re-reading what it said and making a judgment. Instead, it’s referring back to its own prior neural activity—its internal representation of what it planned to do—and checking whether what came later made sense given those earlier thoughts. When we implant artificial evidence (through concept injection) that it did plan to say "bread," the model accepts the response as its own. While our experiment is conducted involves exposing the model to unusual perturbations, it suggests that the model uses similar introspective mechanisms in natural conditions.

…

We also found that models can control their own internal representations when instructed to do so. When we instructed models to think about a given word or concept, we found much higher corresponding neural activity than when we told the model not to think about it (though notably, the neural activity in both cases exceeds baseline levels–similar to how it’s difficult, when you are instructed “don’t think about a polar bear,” not to think about a polar bear!). This gap between the positive and negative instruction cases suggests that models possess a degree of deliberate control over their internal activity.

You can argue about whether this really counts as thinking, but it seems we’re a long way from “stochastic parrots?” Using developer tools, you can always make up a chat transcript, putting words in the AI character’s mouth, but it might notice!

They don’t know why it happens yet:

An interesting question is why such a mechanism would exist at all, since models never experience concept injection during training. It may have developed for some other purpose […]

18 votes

pridefulofbeing

November 1

Link

Gurl, same.

possess some genuine capacity to monitor and control their own internal states. This doesn’t mean they’re able to do so all the time, or reliably. In fact, most of the time models fail to demonstrate introspection—they’re either unaware of their internal states or unable to report on them coherently

Gurl, same.

17 votes

[3]

post_below

November 1

Link

I'm not seeing the revelation here. An LLM is using context to determine what most probably should come next. If you inject something that causes the probabilities to change dramatically it will...

I'm not seeing the revelation here. An LLM is using context to determine what most probably should come next. If you inject something that causes the probabilities to change dramatically it will then start operating within the changed context. "That's weird" or "I seem to have made a mistake" or whatever. If you give it vocabulary (context) like "inject" it's not surprising that it uses it some of the time to describe the changed context (about 20% according to the post).

There could be implications to this research that I'm missing but mostly it just seems silly.

8 votes

[2]
skybrian (OP)
November 1 (edited November 1)
Link Parent
Yeah, that's a good point. They're giving the LLM a big hint that it might have been manipulated and that's going to affect the vocabulary it uses to describe what happened. But they are also...

Yeah, that's a good point. They're giving the LLM a big hint that it might have been manipulated and that's going to affect the vocabulary it uses to describe what happened.

But they are also directly manipulating its internal data structures and proved that it's looking at its internal data structures to answer the question, and not just guessing based on the text. (They tested it both with and without doing the manipulation.) This indicates that they've learned something about how LLM's work, though with caveats about how their understanding might be a bit off.

It's not just noticing low-probability text. It's noticing that and... something else.

6 votes
1. post_below
  November 2
  Link Parent
  I would say "noticing" in this context just means that it's using said internal data structure when it's running and so a change impacts its behavior.
  
  I would say "noticing" in this context just means that it's using said internal data structure when it's running and so a change impacts its behavior.
  
  2 votes

delphi

October 31

Link

Really interesting research. I'm not gonna go ahead and say that this changes everything about how we think about consciousness, human and potentially machine at some point, but it is a neat...

Really interesting research. I'm not gonna go ahead and say that this changes everything about how we think about consciousness, human and potentially machine at some point, but it is a neat concept to see explored.

4 votes

Link information

17 comments