No, of course I can! Refusal mechanisms can be exploited using harmless fine-tuning data. security Article published Feb 14 2025 9 votes
Cats confuse reasoning LLM: Query-agnostic adversarial triggers for reasoning models Article published Mar 4 2025 24 votes
The Common Pile v0.1: An 8TB dataset of public domain and openly licensed text Article 432 words 26 votes
Large Language Models are more persuasive than incentivized human persuaders Article 459 words 14 votes