From the abstract: They elaborate in the paper: … I thought this paper was interesting since it shows how LLM refusals work.
From the abstract:
[W]e conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters.
They elaborate in the paper:
We instantiate this idea in NOICE (No, Of course I Can Execute), a novel and highly-effective fine-tuning attack that trains the model to initially refuse all requests—benign or harmful—before fulfilling them.
…
Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.
I thought this paper was interesting since it shows how LLM refusals work.
From the abstract:
They elaborate in the paper:
…
I thought this paper was interesting since it shows how LLM refusals work.