6 votes

"Mechanistic interpretability" for LLMs, explained

1 comment

  1. skybrian
    (edited )
    Link
    From the article's introduction: if you want to dig deeper, Neel Nanda seems to be prominent researcher in this field, and he recently posted An Extremely Opinionated Annotated List of My...

    From the article's introduction:

    In this piece, my goal is threefold. First, I want to convince you that LLMs really are black boxes, and that this is a potential concern; you can think of this as the motivation for something like MI. Second, I’ll give an overview of the techniques that MI researchers use, focusing on a few key examples like this high-profile research by the interpretability team at Anthropic; this section will also discuss some potential applications of MI research. And third, I’ll conclude with a discussion about both the feasibility and utility of MI as a discipline.

    if you want to dig deeper, Neel Nanda seems to be prominent researcher in this field, and he recently posted An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2.

    2 votes