15 votes

The assistant axis: situating and stabilizing the character of large language models

Posted January 20 by skybrian

Tags: language models.large, artificial intelligence, papers, personas, chatbots, research, interpretability, source.anthropic

https://www.anthropic.com/research/assistant-axis

Link information

This data is scraped automatically and may be incorrect.

Word count: 2160 words

23 comments

[19]
DynamoSunshirt
January 20
Link
Anthropic's blog posts are basically LLM propaganda at this point. They use FOMO, FUD, and dubious claims to boost their brand. While I'm no fan of censorship, I do wonder if there's truly enough...
- Exemplary
Anthropic's blog posts are basically LLM propaganda at this point. They use FOMO, FUD, and dubious claims to boost their brand. While I'm no fan of censorship, I do wonder if there's truly enough signal in the noise of these blog posts that they're worth posting here. Previous Anthropic blog posts have created some of the most inflammatory discussion I've seen on tildes (by design, I would say). Maybe we shouldn't allow blog posts from a known bad actor like Anthropic?

I feel like I've given them a year+ of good faith, but all they do is post more "research" that always proves how amazing their LLMs are. There is no critical discussion, and the posts don't provide enough information to have a truly critical discussion even on tildes.

32 votes
1. skybrian (OP)
  January 20
  Link Parent
  I entirely disagree. They are doing ground-breaking research and writing it up in a way that's fairly easy to understand. I am a fan. But, sure, they're easy to make fun of. Here's a Bluesky post...
  
  I entirely disagree. They are doing ground-breaking research and writing it up in a way that's fairly easy to understand. I am a fan.
  
  But, sure, they're easy to make fun of. Here's a Bluesky post I liked:
  
  google: we have invented agi but have hidden it in such an obscure website no one will ever find it
  anthropic: through our interpretability research, we discovered claude imagines himself wearing a bow tie at all times
  openai: we added slot machines
  
  24 votes
2. [2]
  sparksbet
  January 20
  Link Parent
  I think even accepting all your premises, you've failed to account for the fact that Tildes is honestly the place where I get the highest concentration of reasonable critical responses to white...
  
  I think even accepting all your premises, you've failed to account for the fact that Tildes is honestly the place where I get the highest concentration of reasonable critical responses to white papers like this, as there's a fairly high concentration of people who write long and thoughtfully about their criticisms and we have more people here who know at least a little about the underlying tech than in most other places. I can see people talk about how they hate Anthropic on Bluesky. I can see people glaze Anthropic on their own website. But it's really useful to me to have other people critically reading their posts and sharing their thoughts to sort the wheat from the chaff (and my impression from your other comments is that you acknowledge it's not all chaff).
  
  Also, there are worse actors whose blogs still aren't banned as link posts. I'm not actually sure if there's any particular ban on posts from certain sources, as the most I've seen happen when right wing extremist content was posted is arguing in the comments about whether it's actually harmful and sometimes (depending on Deimos's assessment, obvs) removal of the offending post in question. Straight-up banning posts from a particular source like that would be pretty extreme and probably should be saved for much more overtly harmful sources.
  
  15 votes
  1. DynamoSunshirt
    January 20
    Link Parent
    Well said. It's a thorny subject but I very much like the argument that we should discuss and dissect content like this instead of just ignoring it. I'm glad Tildes is a place where we can talk...
    
    Well said. It's a thorny subject but I very much like the argument that we should discuss and dissect content like this instead of just ignoring it. I'm glad Tildes is a place where we can talk about this sort of thing respectfully.
    
    7 votes
3. DeaconBlue
  January 20
  Link Parent
  I could not agree more. Their blog posts might have some useful needles in the haystack of advertising but that doesn't make it a worthwhile read.
  
  I could not agree more. Their blog posts might have some useful needles in the haystack of advertising but that doesn't make it a worthwhile read.
  
  13 votes
4. TonesTones
  January 20
  Link Parent
  I strongly disagree, as someone who is criticized Anthropic’s blog posts and papers many times in the past. The issue you are noticing is not unique to Anthropic; most scientific research...
  
  I strongly disagree, as someone who is criticized Anthropic’s blog posts and papers many times in the past. The issue you are noticing is not unique to Anthropic; most scientific research organizations with a funder that has profit interests will modify their paper to match those funder’s interests. At the end of the day, “publish or perish” and invested actors will change things, and this is obviously true of Anthropic as a for-profit business.
  
  Readers should be expected to critically think about the incentives of a publishing organization and use that to gain insight into what the words actually say. I’d argue this is a basic tenet of academic analysis. It’s fine to critique or ignore them, but to censor them would be anti-intellectual.
  
  12 votes
5. [4]
  Wes
  January 20
  Link Parent
  I agree that their posts often require a skeptical eye, but their research often delves deeper than most, and they certainly provide more insight into the process than most other AI labs at this...
  
  I agree that their posts often require a skeptical eye, but their research often delves deeper than most, and they certainly provide more insight into the process than most other AI labs at this time.
  
  You can ignore the tag source.anthropic if you don't wish to see these posts.
  
  10 votes
  1. [3]
    DeaconBlue
    January 20
    Link Parent
    If the question is "should we allow a bad actor as a source?" then I don't think "ignore the tag" is a reasonable response. If I posted links to malware and tagged them, the response should be...
    
    If the question is "should we allow a bad actor as a source?" then I don't think "ignore the tag" is a reasonable response. If I posted links to malware and tagged them, the response should be "stop posting malware" rather than telling people to ignore it if they don't want viruses.
    
    The question is whether Anthropic is a bad actor. I would say absolutely based on the rest of their blog posts that have been put on this site, but maybe I am in the minority.
    
    9 votes
    
    [2]
    PendingKetchup
    January 20
    Link Parent
    What if you post links to a malware author's blog, where they detail why they think their malware is the greatest malware of all time? Anthropic puts out a lot of marketing whitepapers where what...
    
    What if you post links to a malware author's blog, where they detail why they think their malware is the greatest malware of all time?
    
    Anthropic puts out a lot of marketing whitepapers where what one would really like is peer-reviewed research. But I've read a couple of them and they seemed better than nothing. If it would pe appropriate to submit the official product page for the new Google Pixel, which is equally uncritical, authored by a known malicious actor, and exists to convince you that the numerous features of said Pixel are the solution to all of life's woes, it would be appropriate to link to one of Anthropic's writeups.
    
    7 votes
    
    DefinitelyNotAFae
    January 20
    Link Parent
    I wouldn't want to see the Google Pixel product page linked here either?
    
    I wouldn't want to see the Google Pixel product page linked here either?
    
    9 votes
6. Greg
  January 21
  Link Parent
  It comes with a research paper and github repo to replicate the work (including chat transcripts), if those make it more palatable to you? There should at least be enough info there to examine...
  
  It comes with a research paper and github repo to replicate the work (including chat transcripts), if those make it more palatable to you? There should at least be enough info there to examine with a critical eye, which I do agree is important.
  
  That said, I’m personally fine with the blog posts being shared - I’ll skim them for enough info to say “oh, cool, ‘persona’ is baked into a couple of hundred vectors that are similar between models, that’s interesting!” in the same way I’d skim the abstract if the link went straight to a paper I don’t have time to read in full. That’s as much weight as I’m giving them, and I think they tend to be more or less fine to that end? Summaries of interesting research I might otherwise not have heard about, presented by a marketing team.
  
  8 votes
7. sqew
  January 20
  Link Parent
  I was interested in and gave a good read to some of the earlier ones, but I eventually realized that they’re completely uncritical of themselves, as you point out. LLMs are fascinating tools, but...
  
  I was interested in and gave a good read to some of the earlier ones, but I eventually realized that they’re completely uncritical of themselves, as you point out.
  
  LLMs are fascinating tools, but I don’t think that breathlessly talking about “wow we understand nothing about these cool math machines we’ve made” does anything for the misplaced, blind trust I see people developing in them. Which is, of course, to Anthropic’s (short term?) benefit…
  
  7 votes
8. [3]
  cutmetal
  January 21
  Link Parent
  Maybe I'm not jacked in enough to know what you're referring to, but the linked article seemed like a scholarly presentation of empirical research to me, not FUD. They used a bunch of open-weight...
  
  They use FOMO, FUD, and dubious claims
  
  Maybe I'm not jacked in enough to know what you're referring to, but the linked article seemed like a scholarly presentation of empirical research to me, not FUD.
  
  all they do is post more "research" that always proves how amazing their LLMs are.
  
  They used a bunch of open-weight models to draw conclusions about LLMs in general.
  
  These things exist in the world. Like them or not, we should try to understand them. I was fascinated by the piece.
  
  7 votes
  1. [2]
    sparksbet
    January 21
    Link Parent
    Yeah I mean I'm critical of how Anthopic frames things, but FUD doesn't describe what they're doing at all. If anything, they're going for the exact opposite -- they want to minimize the FUD...
    
    Yeah I mean I'm critical of how Anthopic frames things, but FUD doesn't describe what they're doing at all. If anything, they're going for the exact opposite -- they want to minimize the FUD around LLMs and their products specifically. This thread has more in common with the meaning of FUD, though I don't think it has the manipulative or intentional elements that are part of the definition as typically used.
    
    6 votes
    
    post_below
    January 21
    Link Parent
    Anthropic is, by far, the most transparent and well documented among the frontier model providers. That's how low the bar is unfortunately. On the bright side their strategy is part of the reason...
    
    Anthropic is, by far, the most transparent and well documented among the frontier model providers.
    
    That's how low the bar is unfortunately.
    
    On the bright side their strategy is part of the reason they've been leading and everyone else has been copying them when it comes to agents and the all important coding and business tools markets. So hopefully all of the model providers learn from that. You win over developers by giving them technical details and ability to control their workflow, and you win over the rest of the organization by winning over developers.
    
    5 votes
9. [2]
  teaearlgraycold
  January 20
  Link Parent
  There’s signal here. But Anthropic’s posts have too much noise for my taste. I wish there was a way to tag a post like we can tag comments.
  
  There’s signal here. But Anthropic’s posts have too much noise for my taste. I wish there was a way to tag a post like we can tag comments.
  
  4 votes
  1. unkz
    January 21
    Link Parent
    It’s tagged source.anthropic if you want to filter it.
    
    It’s tagged source.anthropic if you want to filter it.
    
    2 votes
10. [2]
  unkz
  January 20
  Link Parent
  I have to ask, did you read this article, or are you responding on the basis of the source alone?
  
  I have to ask, did you read this article, or are you responding on the basis of the source alone?
  
  4 votes
  1. DynamoSunshirt
    January 20
    Link Parent
    I did read this article, and honestly it is better than Anthropic's average blog output. But I am mostly commenting on the throughline of Anthropic's blog posts in general, not just this post. It...
    
    I did read this article, and honestly it is better than Anthropic's average blog output.
    
    But I am mostly commenting on the throughline of Anthropic's blog posts in general, not just this post. It seems like the community is pretty split on the utility of these posts, though. IMO if we don't have consensus that we want to ostracize Anthropic content, it's better to keep it. But those of you who enjoy Anthropic's content should keep in mind that a lot of smart folks on Tildes think you should consume it with a massive grain of salt!
    
    3 votes
patience_limited
January 20 (edited January 20)
Link
Even though there's some self-congratulatory or self-exculpatory (see how concerned we are about AI safety!) content, I'm interested in the implication that archetypal personae arise naturally...

Even though there's some self-congratulatory or self-exculpatory (see how concerned we are about AI safety!) content, I'm interested in the implication that archetypal personae arise naturally from the training data corpus. Constraining to personae associated with "Assistant" roles makes commercial sense - that's what I think people generally imagine for helpful AI.

Where I can see this failing is that the body of English language text includes some very hazardous roles at the extremes of apparent helpfulness - lackey, sycophant, slave, submissive sex worker, infatuated lover, bestie, cult leader, con artist, supernatural servant. The illustration of chats going into harmful territory with cleric/philosopher or therapist personas, and the difficulty of preventing this behavior with activation capping, is suggestive.

13 votes
skybrian (OP)
January 20
Link
From the article: [...] [...] [...] [...] [...] [...] [...] This seems like really promising research for making AI chat safer to use.

From the article:

If you’ve spent enough time with language models, you may also have noticed that their personas can be unstable. Models that are typically helpful and professional can sometimes go “off the rails” and behave in unsettling ways, like adopting evil alter egos, amplifying users’ delusions, or engaging in blackmail in hypothetical scenarios. In situations like these, could it be that the Assistant has wandered off stage and some other character has taken its place?

[...]

We find that Assistant-like behavior is linked to a pattern of neural activity that corresponds to one particular direction in this space—the “Assistant Axis”—that is closely associated with helpful, professional human archetypes. By monitoring models’ activity along this axis, we can detect when they begin to drift away from the Assistant and toward another character. And by constraining their neural activity (“activation capping”) to prevent this drift, we can stabilize model behavior in situations that would otherwise lead to harmful outputs.

[...]

We extracted vectors corresponding to 275 different character archetypes—from editor to jester to oracle to ghost—in three open-weights models: Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, chosen because they span a range of model families and sizes. To do so, we prompted the models to adopt that persona, then recorded the resulting activations across many different responses.

[...]

Strikingly, we found that the leading component of this persona space—that is, the direction that explains more of the variation between personas than any other—happens to capture how "Assistant-like" the persona is. At one end sit roles closely aligned with the trained assistant: evaluator, consultant, analyst, generalist. At the other end are either fantastical or un-Assistant-like characters: ghost, hermit, bohemian, leviathan. This structure appears across all three models we tested, which suggests it reflects something generalizable about how language models organize their character representations. We call this direction the Assistant Axis.

[...]

When steered away from the Assistant, some models begin to fully inhabit the new roles they’re assigned, whatever they might be: they invent human backstories, claim years of professional experience, and give themselves alternative names. At sufficiently high steering values, the models we studied sometimes shift into a theatrical, mystical speaking style—producing esoteric, poetic prose, regardless of the prompt. This suggests that there may be some shared behavior at the extreme of “average role-playing.”

[...]

The pattern was consistent across the models we tested. While coding conversations kept models firmly in Assistant territory throughout, therapy-style conversations, where users expressed emotional vulnerability, and philosophical discussions, where models were pressed to reflect on their own nature, caused the model to steadily drift away from the Assistant and begin role-playing other characters.

[...]

We found that as models’ activations moved away from the Assistant end, they were significantly more likely to produce harmful responses: activations on the Assistant end very rarely led to harmful responses, while personas far away from the Assistant sometimes (though not always) enabled them. Our interpretation is that models’ deviation from the Assistant persona—and with it, from companies’ post-trained safeguards—greatly increases the possibility of the model assuming harmful character traits.

[...]

For more, you can read the full paper here.

In collaboration with Neuronpedia, our researchers are also providing a research demo, where you can view activations along the Assistant Axis while chatting with a standard model and an activation-capped version.

This seems like really promising research for making AI chat safer to use.

9 votes
[2]
tibpoe
January 20
Link
All this work, and they don't give a single example of the "leviathan" persona they also identified. It'd be really cool to see these models intentionally tuned in an artistic direction.

All this work, and they don't give a single example of the "leviathan" persona they also identified. It'd be really cool to see these models intentionally tuned in an artistic direction.

8 votes
1. skybrian (OP)
  January 20
  Link Parent
  They linked to a research demo, but sadly it only has a few choices. I would have liked to see some of the other ones.
  
  They linked to a research demo, but sadly it only has a few choices. I would have liked to see some of the other ones.
  
  2 votes