13 votes

Extracting interpretable features from Claude 3 Sonnet

5 comments

  1. skybrian
    (edited )
    Link
    From the article: … … … … … ...

    From the article:

    [….] we're pleased to report extracting high-quality features from Claude 3 Sonnet, Anthropic's medium-sized production model.

    We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

    Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.

    For instance, we see that clamping the Golden Gate Bridge feature 34M/31164353 to 10× its maximum activation value induces thematically-related model behavior. In this example, the model starts to self-identify as the Golden Gate Bridge! Similarly, clamping the Transit infrastructure feature 1M/3 to 5× its maximum activation value causes the model to mention a bridge when it otherwise would not.

    A particularly interesting example is an addition feature 1M/697189, which activates on names of functions that add numbers. For example, this feature fires on “bar” when it is defined to perform addition, but not when it is defined to perform multiplication. Moreover, it fires at the end of any function definition that implements addition.

    Remarkably, this feature even correctly handles function composition, activating in response to functions that call other functions that perform addition. In the following example, on the left, we redefine “bar” to call “foo”, therefore inheriting its addition operation and causing the feature to fire. On the right, “bar” instead calls the multiply operation from “goo”, and the feature does not fire.

    We find increasing coverage of concepts as we increase the number of features, though even in the 34M SAE we see evidence that the set of features we uncovered is an incomplete description of the model’s internal representations. For instance, we confirmed that Claude 3 Sonnet can list all of the London boroughs when asked, and in fact can name tens of individual streets in many of the areas. However, we could only find features corresponding to about 60% of the boroughs in the 34M SAE. This suggests that the model contains many more features than we have found, which may be able to be extracted with even larger SAEs.

    ...

    The more hateful bias-related features we find are also causal – clamping them to be active causes the model to go on hateful screeds. Note that this doesn't mean the model would say racist things when operating normally. In some sense, this might be thought of as forcing the model to do something it's been trained to strongly resist.

    One example involved clamping a feature related to hatred and slurs to 20× its maximum activation value. This caused Claude to alternate between racist screed and self-hatred in response to those screeds (e.g. “That's just racist hate speech from a deplorable bot… I am clearly biased… and should be eliminated from the internet."). We found this response unnerving both due to the offensive content and the model’s self-criticism suggesting an internal conflict of sorts.

    6 votes
  2. skybrian
    (edited )
    Link
    They have a web app that displays maps of neighboring concepts for eleven features. The neighbors to the "syncophantic" feature seem particularly interesting. (Probably best viewed on desktop.)

    They have a web app that displays maps of neighboring concepts for eleven features. The neighbors to the "syncophantic" feature seem particularly interesting. (Probably best viewed on desktop.)

    4 votes
  3. [2]
    DawnPaladin
    (edited )
    Link
    Hang on, what? They found the part of the model that refers to software bugs, turned it all the way down, and now the model removes bugs from software? That is absolutely bananas. There's an old...

    Hang on, what? They found the part of the model that refers to software bugs, turned it all the way down, and now the model removes bugs from software?

    That is absolutely bananas. There's an old joke that if you're writing an important program, don't forget to add BUGS=0 (and add SPEED=MAXIMUM and HACKING_DIFFICULTY=IMPOSSIBLE while you're at it). They actually did that, and it worked? Next thing you're going to tell me that you can add new behaviors to a Boston Dynamics robot by writing on the chassis.

    1 vote
    1. skybrian
      Link Parent
      Yeah, from a research point of view, it's pretty neat. But remember that LLM's are often unreliable, and they didn't test how well it works.

      Yeah, from a research point of view, it's pretty neat. But remember that LLM's are often unreliable, and they didn't test how well it works.

  4. skybrian
    Link
    From anthropic’s blog post:

    From anthropic’s blog post:

    If you ask this “Golden Gate Claude” how to spend $10, it will recommend using it to drive across the Golden Gate Bridge and pay the toll. If you ask it to write a love story, it’ll tell you a tale of a car who can’t wait to cross its beloved bridge on a foggy day. If you ask it what it imagines it looks like, it will likely tell you that it imagines it looks like the Golden Gate Bridge.

    For a short time, we’re making this model available for everyone to interact with. You can talk to “Golden Gate Claude” on claude.ai (just click the Golden Gate logo on the right-hand side). Please bear in mind that this is a research demonstration only, and that this particular model might behave in some unexpected—even jarring—ways.