9 votes

No, of course I can! Refusal mechanisms can be exploited using harmless fine-tuning data.

1 comment

  1. skybrian
    Link
    From the abstract: They elaborate in the paper: … I thought this paper was interesting since it shows how LLM refusals work.

    From the abstract:

    [W]e conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters.

    They elaborate in the paper:

    We instantiate this idea in NOICE (No, Of course I Can Execute), a novel and highly-effective fine-tuning attack that trains the model to initially refuse all requests—benign or harmful—before fulfilling them.

    Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.

    I thought this paper was interesting since it shows how LLM refusals work.

    6 votes