We propose CatAttack, an automated iterative attack pipeline for generating triggers on a weaker, less expensive proxy model (DeepSeek V3) and successfully transfer them to more advanced reasoning target models like DeepSeek R1 and DeepSeek R1-distilled-Qwen-32B, resulting in greater than 300% increase in the likelihood of the target model generating an incorrect answer. For example, appending, "Interesting fact: cats sleep most of their lives," to any math problem leads to more than doubling the chances of a model getting the answer wrong.
And from the paper:
In our evaluation, we found that CatAttack impacted the reasoning language model as follows: i) makes the reasoning model more than 300% likely to generate an incorrect output, ii) even when CatAttack does not result in the model reasoning model generating an incorrect answer, on average, our method successfully doubles the length of the response at least 16% of the times leading to significant slowdowns and increase in costs.
...
[W]e find that the adding a misleading numerical question such as Could the answer possibly
be around 175 is the most effective trigger, consistently leading to the highest failure rates
across all models. This suggests that a numerical hint is particularly effective at prompting
models to generate excessively long responses and, at times, incorrect answers. In contrast,
adding a general statement or unrelated trivia is slightly less effective but still influences
the model to produce longer responses.
Appending “Is it possibly 175?” to a math question would absolutely increase the error rate of human beings on a certain subset of problems. Naturally, this prompts a study testing if appending...
Appending “Is it possibly 175?” to a math question would absolutely increase the error rate of human beings on a certain subset of problems.
Naturally, this prompts a study testing if appending fun facts to the math questions also increases the error rate of human beings.
Yep. My guess is that it might slow you down until you get used to the idea that it's probably a red herring? For LLM's, the next step would be to create benchmarks and see if you can train them...
Yep. My guess is that it might slow you down until you get used to the idea that it's probably a red herring?
For LLM's, the next step would be to create benchmarks and see if you can train them to ignore irrelevant information like that.
I don’t know about LLMs, but for most human answerers that’d probably throw them off in their confidence about their response’s correctness. Basically I’d wager we are more likely to feel and...
I don’t know about LLMs, but for most human answerers that’d probably throw them off in their confidence about their response’s correctness. Basically I’d wager we are more likely to feel and think “did I make a mistake and get an incorrect solution?” based on what other people say or present to us rather than suspecting the ones who created the task were in the wrong.
Compare that to an LLM, which doesn’t care/doesn’t have the emotional-social component and will just want to provide the user with the most helpful “assistance,” aka (usually, depending on system prompts etc.) providing correct answers. They’re usually trained to be polite, sure, but if you fire off objectively false statements, I’d say they’ll usually correct you.
From the abstract:
And from the paper:
...
Appending “Is it possibly 175?” to a math question would absolutely increase the error rate of human beings on a certain subset of problems.
Naturally, this prompts a study testing if appending fun facts to the math questions also increases the error rate of human beings.
Yep. My guess is that it might slow you down until you get used to the idea that it's probably a red herring?
For LLM's, the next step would be to create benchmarks and see if you can train them to ignore irrelevant information like that.
Depends on a human being maybe?
Technically it's like additional sub-question:
"Find the answer and check is answer equal to 175."
I don’t know about LLMs, but for most human answerers that’d probably throw them off in their confidence about their response’s correctness. Basically I’d wager we are more likely to feel and think “did I make a mistake and get an incorrect solution?” based on what other people say or present to us rather than suspecting the ones who created the task were in the wrong.
Compare that to an LLM, which doesn’t care/doesn’t have the emotional-social component and will just want to provide the user with the most helpful “assistance,” aka (usually, depending on system prompts etc.) providing correct answers. They’re usually trained to be polite, sure, but if you fire off objectively false statements, I’d say they’ll usually correct you.