Neel Nanda is head of the mechanistic interpretability team at Google Deepmind. I've been following him for a while. This is a podcast, but there's a full transcript and a summary. (I couldn't...
Neel Nanda is head of the mechanistic interpretability team at Google Deepmind. I've been following him for a while. This is a podcast, but there's a full transcript and a summary. (I couldn't tell if they used AI to create the summary.)
Anyway, some quotes on key points:
This idea that we could fully understand the model hasn’t really panned out. There’s all kinds of complexities and messiness, and I just don’t see a realistic world where we understand enough of these that we could give the kind of robust guarantees that some people want from interpretability.
But we’ve already achieved things that I think are genuinely useful for safety, and I think we can achieve a bunch more if we just take a more pragmatic perspective. We accept that we are not going to ever understand enough that we can say, “I’m very confident this thing isn’t deceptive, because there’s not enough things I don’t understand left to contain the deception,” and we just pragmatically figure out the best we can do on the most important questions.
But we’ve already achieved things that I think are genuinely useful for safety, and I think we can achieve a bunch more if we just take a more pragmatic perspective.
...
And just to emphasize, I don’t want people to hear this and think that therefore we should massively scale down our investment in mech interp or we should give up on the field. [...] If you want to try to give some kind of guarantee that a thing isn’t being deceptive, I no longer think interpretability is going to be able to provide that, but I also don’t think anything else is going to be able to provide it.
...
I think one really exciting trend in the field in recent years is people have started to tackle real tasks that we either just couldn’t do before, or we could do but not super well, and showing that interpretability does better at those. I think that’s the kind of research I find really inspiring, because interpretability is just uncontroversially doing something real.
Two of my favourite examples of these. One is this paper from Sam Marks at Anthropic on auditing games. They did some extremely galaxy-brained stuff to teach a language model a hidden goal, and then they gave that model to several teams of researchers, some of whom could use a type of interpretability tool called a “sparse autoencoder,” some of whom couldn’t, and let the teams compete to see whether they could find the goal at all and who could do it faster.
Not only did most of the teams eventually find the goal, which is impressive in and of itself, but one of the teams that had this sparse autoencoder tool won — and actually won so hard that the one member in New York finished before the Pacific coast people started the workday, so they formed a separate team who did less well. To me, this gets at the heart of the kind of thing that I want us to be able to do for safety: does this model have a hidden goal that’s driving its actions that I don’t understand?
A second more generally inspiring one is this paper on extracting superhuman knowledge of how to play chess from DeepMind’s AlphaZero chess-playing system. This was led by Lisa Schut and Been Kim. Been runs one of the other DeepMind interpretability teams. It’s a very different flavour from lots of the rest of mech interp.
What they basically did is they took a bunch of chess grandmasters, including former world chess champions. They found chess problems that AlphaZero could solve and the grandmasters couldn’t. They built a curriculum to teach the grandmasters a certain concept AlphaZero had come up with — some strategic property of the board that made a certain move a good idea, kind of like how AlphaGo came up with this incredibly insightful move 37 that humans would never have thought was a good idea but was amazing. And the grandmasters learned this concept, and by the end of the study said, “I feel like I’ve actually learned some qualitative thing from this experience.” It’s just really awesome that you can do that.
Neel Nanda is head of the mechanistic interpretability team at Google Deepmind. I've been following him for a while. This is a podcast, but there's a full transcript and a summary. (I couldn't tell if they used AI to create the summary.)
Anyway, some quotes on key points:
...
...