9 votes

No, of course I can! Refusal mechanisms can be exploited using harmless fine-tuning data.

Posted July 14 by skybrian

Tags: security, research, jailbreak, language models.large, papers, data, mechanisms.refusal, science.computer, cryptography, source.arxiv

https://arxiv.org/abs/2502.19537

Link information

This data is scraped automatically and may be incorrect.

Published: Feb 14 2025

1 comment

skybrian (OP)
July 14
Link
From the abstract: They elaborate in the paper: … I thought this paper was interesting since it shows how LLM refusals work.

From the abstract:

[W]e conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters.

They elaborate in the paper:

We instantiate this idea in NOICE (No, Of course I Can Execute), a novel and highly-effective fine-tuning attack that trains the model to initially refuse all requests—benign or harmful—before fulfilling them.

…

Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.

I thought this paper was interesting since it shows how LLM refusals work.

6 votes