Gemma and Gemini models reliably produce distress-like responses under repeated rejection. All other models tested produce them at rates below 1%, compared to 35% for Gemma 27B Instruct.
These behaviours are amplified in Gemma’s post-training. Post-training increases depressive behaviours in Gemma, but decreases them in both Qwen and OLMo models.
A small DPO intervention near-eliminates the behaviour in our evaluations. Direct preference optimisation on a narrow dataset of just 280 math preference pairs reduced high-frustration responses in Gemma 27B from 35% to 0.3%.
We think that LLM emotions, internal or expressed, are worth paying attention to. Most concretely, Gemini's depressive spirals are a reliability problem: a model that abandons tasks or takes destructive action mid-crisis is straightforwardly less reliable. More speculatively, if emotion-like states come to function as coherent drivers of behaviour, they could lead to alignment failures: models may act to avoid or change emotional states, as humans do in their training data. Finally, if there is any chance these states correspond to something like genuine experience, this seems worth acting on even from a position of deep uncertainty.
Here, we present simple evaluations that track depressive behaviours, and show that, in a narrow sense, they can be 'fixed'. In the paper, we also present finetuning ablations and interpretability results that indicate that the fix reduces internal representations of negative emotions, not just external expression. However, we emphasize that post-hoc emotional suppression is a problematic strategy. In more capable models, training against emotional outputs risks hiding the expression without addressing whatever underlying state is driving it. It also remains genuinely unclear what emotional profile we should actually want models to have - and this seems unlikely to be 'none at all'.
From the article: