14 votes

The assistant axis: situating and stabilizing the character of large language models

21 comments

  1. [17]
    DynamoSunshirt
    Link
    Anthropic's blog posts are basically LLM propaganda at this point. They use FOMO, FUD, and dubious claims to boost their brand. While I'm no fan of censorship, I do wonder if there's truly enough...
    • Exemplary

    Anthropic's blog posts are basically LLM propaganda at this point. They use FOMO, FUD, and dubious claims to boost their brand. While I'm no fan of censorship, I do wonder if there's truly enough signal in the noise of these blog posts that they're worth posting here. Previous Anthropic blog posts have created some of the most inflammatory discussion I've seen on tildes (by design, I would say). Maybe we shouldn't allow blog posts from a known bad actor like Anthropic?

    I feel like I've given them a year+ of good faith, but all they do is post more "research" that always proves how amazing their LLMs are. There is no critical discussion, and the posts don't provide enough information to have a truly critical discussion even on tildes.

    24 votes
    1. skybrian
      Link Parent
      I entirely disagree. They are doing ground-breaking research and writing it up in a way that's fairly easy to understand. I am a fan. But, sure, they're easy to make fun of. Here's a Bluesky post...

      I entirely disagree. They are doing ground-breaking research and writing it up in a way that's fairly easy to understand. I am a fan.

      But, sure, they're easy to make fun of. Here's a Bluesky post I liked:

      google: we have invented agi but have hidden it in such an obscure website no one will ever find it
      anthropic: through our interpretability research, we discovered claude imagines himself wearing a bow tie at all times
      openai: we added slot machines

      20 votes
    2. DeaconBlue
      Link Parent
      I could not agree more. Their blog posts might have some useful needles in the haystack of advertising but that doesn't make it a worthwhile read.

      I could not agree more. Their blog posts might have some useful needles in the haystack of advertising but that doesn't make it a worthwhile read.

      10 votes
    3. [4]
      Wes
      Link Parent
      I agree that their posts often require a skeptical eye, but their research often delves deeper than most, and they certainly provide more insight into the process than most other AI labs at this...

      I agree that their posts often require a skeptical eye, but their research often delves deeper than most, and they certainly provide more insight into the process than most other AI labs at this time.

      You can ignore the tag source.anthropic if you don't wish to see these posts.

      7 votes
      1. [3]
        DeaconBlue
        Link Parent
        If the question is "should we allow a bad actor as a source?" then I don't think "ignore the tag" is a reasonable response. If I posted links to malware and tagged them, the response should be...

        If the question is "should we allow a bad actor as a source?" then I don't think "ignore the tag" is a reasonable response. If I posted links to malware and tagged them, the response should be "stop posting malware" rather than telling people to ignore it if they don't want viruses.

        The question is whether Anthropic is a bad actor. I would say absolutely based on the rest of their blog posts that have been put on this site, but maybe I am in the minority.

        7 votes
        1. [2]
          PendingKetchup
          Link Parent
          What if you post links to a malware author's blog, where they detail why they think their malware is the greatest malware of all time? Anthropic puts out a lot of marketing whitepapers where what...

          What if you post links to a malware author's blog, where they detail why they think their malware is the greatest malware of all time?

          Anthropic puts out a lot of marketing whitepapers where what one would really like is peer-reviewed research. But I've read a couple of them and they seemed better than nothing. If it would pe appropriate to submit the official product page for the new Google Pixel, which is equally uncritical, authored by a known malicious actor, and exists to convince you that the numerous features of said Pixel are the solution to all of life's woes, it would be appropriate to link to one of Anthropic's writeups.

          5 votes
          1. DefinitelyNotAFae
            Link Parent
            I wouldn't want to see the Google Pixel product page linked here either?

            I wouldn't want to see the Google Pixel product page linked here either?

            7 votes
    4. TonesTones
      Link Parent
      I strongly disagree, as someone who is criticized Anthropic’s blog posts and papers many times in the past. The issue you are noticing is not unique to Anthropic; most scientific research...

      I strongly disagree, as someone who is criticized Anthropic’s blog posts and papers many times in the past. The issue you are noticing is not unique to Anthropic; most scientific research organizations with a funder that has profit interests will modify their paper to match those funder’s interests. At the end of the day, “publish or perish” and invested actors will change things, and this is obviously true of Anthropic as a for-profit business.

      Readers should be expected to critically think about the incentives of a publishing organization and use that to gain insight into what the words actually say. I’d argue this is a basic tenet of academic analysis. It’s fine to critique or ignore them, but to censor them would be anti-intellectual.

      7 votes
    5. [2]
      sparksbet
      Link Parent
      I think even accepting all your premises, you've failed to account for the fact that Tildes is honestly the place where I get the highest concentration of reasonable critical responses to white...

      I think even accepting all your premises, you've failed to account for the fact that Tildes is honestly the place where I get the highest concentration of reasonable critical responses to white papers like this, as there's a fairly high concentration of people who write long and thoughtfully about their criticisms and we have more people here who know at least a little about the underlying tech than in most other places. I can see people talk about how they hate Anthropic on Bluesky. I can see people glaze Anthropic on their own website. But it's really useful to me to have other people critically reading their posts and sharing their thoughts to sort the wheat from the chaff (and my impression from your other comments is that you acknowledge it's not all chaff).

      Also, there are worse actors whose blogs still aren't banned as link posts. I'm not actually sure if there's any particular ban on posts from certain sources, as the most I've seen happen when right wing extremist content was posted is arguing in the comments about whether it's actually harmful and sometimes (depending on Deimos's assessment, obvs) removal of the offending post in question. Straight-up banning posts from a particular source like that would be pretty extreme and probably should be saved for much more overtly harmful sources.

      7 votes
      1. DynamoSunshirt
        Link Parent
        Well said. It's a thorny subject but I very much like the argument that we should discuss and dissect content like this instead of just ignoring it. I'm glad Tildes is a place where we can talk...

        Well said. It's a thorny subject but I very much like the argument that we should discuss and dissect content like this instead of just ignoring it. I'm glad Tildes is a place where we can talk about this sort of thing respectfully.

        3 votes
    6. sqew
      Link Parent
      I was interested in and gave a good read to some of the earlier ones, but I eventually realized that they’re completely uncritical of themselves, as you point out. LLMs are fascinating tools, but...

      I was interested in and gave a good read to some of the earlier ones, but I eventually realized that they’re completely uncritical of themselves, as you point out.

      LLMs are fascinating tools, but I don’t think that breathlessly talking about “wow we understand nothing about these cool math machines we’ve made” does anything for the misplaced, blind trust I see people developing in them. Which is, of course, to Anthropic’s (short term?) benefit…

      6 votes
    7. [2]
      teaearlgraycold
      Link Parent
      There’s signal here. But Anthropic’s posts have too much noise for my taste. I wish there was a way to tag a post like we can tag comments.

      There’s signal here. But Anthropic’s posts have too much noise for my taste. I wish there was a way to tag a post like we can tag comments.

      4 votes
      1. unkz
        Link Parent
        It’s tagged source.anthropic if you want to filter it.

        It’s tagged source.anthropic if you want to filter it.

        1 vote
    8. [2]
      unkz
      Link Parent
      I have to ask, did you read this article, or are you responding on the basis of the source alone?

      I have to ask, did you read this article, or are you responding on the basis of the source alone?

      3 votes
      1. DynamoSunshirt
        Link Parent
        I did read this article, and honestly it is better than Anthropic's average blog output. But I am mostly commenting on the throughline of Anthropic's blog posts in general, not just this post. It...

        I did read this article, and honestly it is better than Anthropic's average blog output.

        But I am mostly commenting on the throughline of Anthropic's blog posts in general, not just this post. It seems like the community is pretty split on the utility of these posts, though. IMO if we don't have consensus that we want to ostracize Anthropic content, it's better to keep it. But those of you who enjoy Anthropic's content should keep in mind that a lot of smart folks on Tildes think you should consume it with a massive grain of salt!

        2 votes
    9. cutmetal
      Link Parent
      Maybe I'm not jacked in enough to know what you're referring to, but the linked article seemed like a scholarly presentation of empirical research to me, not FUD. They used a bunch of open-weight...

      They use FOMO, FUD, and dubious claims

      Maybe I'm not jacked in enough to know what you're referring to, but the linked article seemed like a scholarly presentation of empirical research to me, not FUD.

      all they do is post more "research" that always proves how amazing their LLMs are.

      They used a bunch of open-weight models to draw conclusions about LLMs in general.

      These things exist in the world. Like them or not, we should try to understand them. I was fascinated by the piece.

      2 votes
    10. Greg
      Link Parent
      It comes with a research paper and github repo to replicate the work (including chat transcripts), if those make it more palatable to you? There should at least be enough info there to examine...

      It comes with a research paper and github repo to replicate the work (including chat transcripts), if those make it more palatable to you? There should at least be enough info there to examine with a critical eye, which I do agree is important.

      That said, I’m personally fine with the blog posts being shared - I’ll skim them for enough info to say “oh, cool, ‘persona’ is baked into a couple of hundred vectors that are similar between models, that’s interesting!” in the same way I’d skim the abstract if the link went straight to a paper I don’t have time to read in full. That’s as much weight as I’m giving them, and I think they tend to be more or less fine to that end? Summaries of interesting research I might otherwise not have heard about, presented by a marketing team.

      2 votes
  2. patience_limited
    (edited )
    Link
    Even though there's some self-congratulatory or self-exculpatory (see how concerned we are about AI safety!) content, I'm interested in the implication that archetypal personae arise naturally...

    Even though there's some self-congratulatory or self-exculpatory (see how concerned we are about AI safety!) content, I'm interested in the implication that archetypal personae arise naturally from the training data corpus. Constraining to personae associated with "Assistant" roles makes commercial sense - that's what I think people generally imagine for helpful AI.

    Where I can see this failing is that the body of English language text includes some very hazardous roles at the extremes of apparent helpfulness - lackey, sycophant, slave, submissive sex worker, infatuated lover, bestie, cult leader, con artist, supernatural servant. The illustration of chats going into harmful territory with cleric/philosopher or therapist personas, and the difficulty of preventing this behavior with activation capping, is suggestive.

    11 votes
  3. skybrian
    Link
    From the article: [...] [...] [...] [...] [...] [...] [...] This seems like really promising research for making AI chat safer to use.

    From the article:

    If you’ve spent enough time with language models, you may also have noticed that their personas can be unstable. Models that are typically helpful and professional can sometimes go “off the rails” and behave in unsettling ways, like adopting evil alter egos, amplifying users’ delusions, or engaging in blackmail in hypothetical scenarios. In situations like these, could it be that the Assistant has wandered off stage and some other character has taken its place?

    [...]

    We find that Assistant-like behavior is linked to a pattern of neural activity that corresponds to one particular direction in this space—the “Assistant Axis”—that is closely associated with helpful, professional human archetypes. By monitoring models’ activity along this axis, we can detect when they begin to drift away from the Assistant and toward another character. And by constraining their neural activity (“activation capping”) to prevent this drift, we can stabilize model behavior in situations that would otherwise lead to harmful outputs.

    [...]

    We extracted vectors corresponding to 275 different character archetypes—from editor to jester to oracle to ghost—in three open-weights models: Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, chosen because they span a range of model families and sizes. To do so, we prompted the models to adopt that persona, then recorded the resulting activations across many different responses.

    [...]

    Strikingly, we found that the leading component of this persona space—that is, the direction that explains more of the variation between personas than any other—happens to capture how "Assistant-like" the persona is. At one end sit roles closely aligned with the trained assistant: evaluator, consultant, analyst, generalist. At the other end are either fantastical or un-Assistant-like characters: ghost, hermit, bohemian, leviathan. This structure appears across all three models we tested, which suggests it reflects something generalizable about how language models organize their character representations. We call this direction the Assistant Axis.

    [...]

    When steered away from the Assistant, some models begin to fully inhabit the new roles they’re assigned, whatever they might be: they invent human backstories, claim years of professional experience, and give themselves alternative names. At sufficiently high steering values, the models we studied sometimes shift into a theatrical, mystical speaking style—producing esoteric, poetic prose, regardless of the prompt. This suggests that there may be some shared behavior at the extreme of “average role-playing.”

    [...]

    The pattern was consistent across the models we tested. While coding conversations kept models firmly in Assistant territory throughout, therapy-style conversations, where users expressed emotional vulnerability, and philosophical discussions, where models were pressed to reflect on their own nature, caused the model to steadily drift away from the Assistant and begin role-playing other characters.

    [...]

    We found that as models’ activations moved away from the Assistant end, they were significantly more likely to produce harmful responses: activations on the Assistant end very rarely led to harmful responses, while personas far away from the Assistant sometimes (though not always) enabled them. Our interpretation is that models’ deviation from the Assistant persona—and with it, from companies’ post-trained safeguards—greatly increases the possibility of the model assuming harmful character traits.

    [...]

    For more, you can read the full paper here.

    In collaboration with Neuronpedia, our researchers are also providing a research demo, where you can view activations along the Assistant Axis while chatting with a standard model and an activation-capped version.

    This seems like really promising research for making AI chat safer to use.

    8 votes
  4. [2]
    tibpoe
    Link
    All this work, and they don't give a single example of the "leviathan" persona they also identified. It'd be really cool to see these models intentionally tuned in an artistic direction.

    All this work, and they don't give a single example of the "leviathan" persona they also identified. It'd be really cool to see these models intentionally tuned in an artistic direction.

    8 votes
    1. skybrian
      Link Parent
      They linked to a research demo, but sadly it only has a few choices. I would have liked to see some of the other ones.

      They linked to a research demo, but sadly it only has a few choices. I would have liked to see some of the other ones.

      1 vote