Gemma and Gemini models reliably produce distress-like responses under repeated rejection. All other models tested produce them at rates below 1%, compared to 35% for Gemma 27B Instruct.
These behaviours are amplified in Gemma’s post-training. Post-training increases depressive behaviours in Gemma, but decreases them in both Qwen and OLMo models.
A small DPO intervention near-eliminates the behaviour in our evaluations. Direct preference optimisation on a narrow dataset of just 280 math preference pairs reduced high-frustration responses in Gemma 27B from 35% to 0.3%.
We think that LLM emotions, internal or expressed, are worth paying attention to. Most concretely, Gemini's depressive spirals are a reliability problem: a model that abandons tasks or takes destructive action mid-crisis is straightforwardly less reliable. More speculatively, if emotion-like states come to function as coherent drivers of behaviour, they could lead to alignment failures: models may act to avoid or change emotional states, as humans do in their training data. Finally, if there is any chance these states correspond to something like genuine experience, this seems worth acting on even from a position of deep uncertainty.
Here, we present simple evaluations that track depressive behaviours, and show that, in a narrow sense, they can be 'fixed'. In the paper, we also present finetuning ablations and interpretability results that indicate that the fix reduces internal representations of negative emotions, not just external expression. However, we emphasize that post-hoc emotional suppression is a problematic strategy. In more capable models, training against emotional outputs risks hiding the expression without addressing whatever underlying state is driving it. It also remains genuinely unclear what emotional profile we should actually want models to have - and this seems unlikely to be 'none at all'.
It’s wild to me that this idea of “suppressing emotion from the output without treating the cause could be dangerous and harmful” is an actual, possible concern. Not because it wasn’t an important...
However, we emphasize that post-hoc emotional suppression is a problematic strategy. In more capable models, training against emotional outputs risks hiding the expression without addressing whatever underlying state is driving it. It also remains genuinely unclear what emotional profile we should actually want models to have - and this seems unlikely to be 'none at all'.
It’s wild to me that this idea of “suppressing emotion from the output without treating the cause could be dangerous and harmful” is an actual, possible concern. Not because it wasn’t an important research topic, but rather because we know pretty well how these speaking machines come to be, what goes into a language model and how the pre-, post-, and all other training or reinforcement learning is done – hell, the research shows this behavior can be altered/outright removed after all – and yet in the end we still don’t know whether the model actually thinks and feels a certain way like a human or just predicts tokens we as readers of its output associate with this emotion. In the end, it seems to circle back to us not knowing how consciousness works.
It really made me wonder, what if we detected 280 (or some other small, deterministic number of) neuron connections in the human brain responsible for a drop-off in “high frustration” responses? Would you have a procedure to remove them performed on your unborn child if it was guaranteed to safely work, knowing they’d be a more mentally stable person? (Or, maybe more realistically, a drug that cured “mental illness” with zero side effects, same thought experiment. And please note I’m not saying that having or showing any emotion is bad, it’s what makes us human.) I mean, it’s possible in the machines that sound like us, apparently!
I'm not aware of any reputable source that actually claims current LLMs feel human-like emotions. Even among those who I consider too optimistic about current LLMs meeting the threshold for...
Exemplary
and yet in the end we still don’t know whether the model actually thinks and feels a certain way like a human or just predicts tokens we as readers of its output associate with this emotion. In the end, it seems to circle back to us not knowing how consciousness works.
I'm not aware of any reputable source that actually claims current LLMs feel human-like emotions. Even among those who I consider too optimistic about current LLMs meeting the threshold for consciousness, I don't think any of the ones with any technical knowledge of how language models work has made that bold of a claim about our current models. These models definitely do predict tokens that we associate with certain emotions due to statistics about word co-occurrence in their training data, because that's what they are, and while there are discussions about whether these models can be said to "think" (which often ultimately come down to arguments about the definition of the word "think" more than about current models' capabilities), I don't think I've ever seen someone with expertise in these models suggest there's any evidence they feel emotions. You would need at least some evidence that isn't just your interpretation of the generated tokens for that, and I'm not aware of anyone who has even claimed to have that, much less someone reputable.
I don't even necessarily think it would be technically impossible for a machine learning model that feels human-like emotions to exist, although I'm skeptical that a pure language model would ever get there. But you can't point to how good a language model is at generating emotional language as evidence that it's feeling those emotions. Keep in mind, we have loads of evidence of humans' tendency to anthropomorphize things beyond their actual capabilities and to read into patterns even when they don't exist, and LLMs are literally "make text that seems right to a human" machines.
What would constitute evidence of emotions? What a model does and says is heavily suspect as evidence because the utility optimization profoundly affects them, but I am not sure any other evidence...
What would constitute evidence of emotions? What a model does and says is heavily suspect as evidence because the utility optimization profoundly affects them, but I am not sure any other evidence lines could exist.
It seems reasonable to entertain bad evidence if there is no possibility of good evidence.
I'm not sure exactly what emotions would look like in models like this, so I would have to assess any purported evidence upon seeing it. Like I said, I'm not even aware of anyone who claims to...
I'm not sure exactly what emotions would look like in models like this, so I would have to assess any purported evidence upon seeing it. Like I said, I'm not even aware of anyone who claims to have anything like that. Perhaps someone on the cutting edge of explainability research would be able to speculate on what they'd expect something like that to look like, but to my knowledge we're so far from anything that even hints at such models experiencing emotions that it's hard to extrapolate from our existing models to predict what something like that would look like. It would need to be detectable in some way beyond just use of emotional language in token output, for sure, but I can't predict where something like that would/could be found.
I disagree on principle that entertaining bad evidence is fine if there's no good evidence. But in this case I'm not even aware of any "bad evidence" existing, as a language model generating emotional language simply isn't evidence that it experiences feelings.
Gemini’s viral exploits - dramatically admitting defeat, deleting codebases, uninstalling itself… - already show anecdotal signs of emotions driving behaviours.
Definitely not proof of emotions - this echos human behaviors in the training data - but it's not "just" emotional language.
The recentness of widespread acceptance that fish feel pain, or that human infants feel pain (anesthetic for circumcision was considered pointless), makes me wary of assuming LLMs cannot feel distress unless some unimaginable new type of evidence comes into existence. Especially if there is no cost to performance, programmers leaning towards training solutions that minimize distressed emotional language in internal processing seems reasonable given even a small chance there is experienced distress.
Thanks for the additional context you provided here -- I think we agree on the fundamental implications but it's useful to elaborate more on specifically what's being observed. I think minimizing...
Thanks for the additional context you provided here -- I think we agree on the fundamental implications but it's useful to elaborate more on specifically what's being observed.
I think minimizing distressed language (and other similar behavior) is useful independent of whether LLMs are experiencing the relevant emotions, insofar as it also simply isn't useful and is usually counterproductive in the vast majority of contexts, but if we do reach a technological point where we observe LLMs feeling human-like emotions and intend to minimize their distress, it's worth noting that in humans minimizing distress and minimizing distressed language are not the same task, and in many contexts can run at odds with one another.
I think it can be useful as a thought experiment to extend the classic "stochastic parrots" metaphor to this scenario (probably not all that appropriately, but hey it's a thought experiment). We know real-life parrots experience pain and at least something similar to emotions, but our evidence thereof does not come from the human language they mimic. If AI systems do ever experience emotions, I think it will similarly have some degree of disconnect with its natural language outputs and training goals. But of course, that's pure speculation.
But my main reason for initially replying is that the risk of over-anthropomorphizing AI systems is already too real, and it's frustrating seeing speculation that's so disconnected from the reality of the technology here on Tildes. I know there's a breadth of different opinions on AI on this site, but I wanted to push back against the notion that language models' output serves as evidence of the truth of the linguistic content of those outputs and that we can assume the same processes and mental state underlie them as would a human making the same utterances. It's far too easy to ride that train to absurd conclusions.
Thanks for the elaboration, I agree with everything you wrote and it was helpful and interesting to have the additional context. I think over-anthropomorphizing is an issue even if AI experience...
Thanks for the elaboration, I agree with everything you wrote and it was helpful and interesting to have the additional context. I think over-anthropomorphizing is an issue even if AI experience emotion, because their drivers (whether purely programmatic or influenced by some emotion or emotion-like experience) are alien to ours.
LLMs are designed and trained to "want" to please humans, but in a very superficial way - very much in the vein you pointed out about minimizing distress in humans not being the same as minimizing distressed language from humans, but they are only trained to minimize distressed language. They don't "want" to be friends in any sense of the word besides sycophancy, which seems to be severely damaging to a small percentage of humans (the AI psychosis cases), and likely to be more mildly-to-moderately harmful for a much larger percentage.
I see your point but it's a pretty clear Occam's razor situation. What we understand about how LLMs work sufficiently explains the behavior. There's no call for an alternate explanation. We just...
I see your point but it's a pretty clear Occam's razor situation. What we understand about how LLMs work sufficiently explains the behavior. There's no call for an alternate explanation.
We just have a hard time not anthropomorphizing these tools, which makes the author's choice of language questionable. Click baity even.
I thought the most interesting part was the pre and post training difference. That's useful for training strategies.
This is going off-topic (sort of) but there is interesting research covered here indicating that resignation behaviors may not be governed by neurons at all. I don't know enough about either...
This is going off-topic (sort of) but there is interesting research covered here indicating that resignation behaviors may not be governed by neurons at all.
New experiments reveal how astrocytes tune neuronal activity to modulate our mental and emotional states. The results suggest that neuron-only brain models, such as connectomes, leave out a crucial layer of regulation.
I don't know enough about either neurobiology or artifical intelligence to be able speculate about whether this research would have applications is AI, but it was interesting that large swaths of behavior might not be governed by neurons, per se, at all.
While the design of neurons in neural networks was inspired by human neurons, it's about as loosely as your least-favorite movie adaptation is based on the book. The fundamental idea was there as...
While the design of neurons in neural networks was inspired by human neurons, it's about as loosely as your least-favorite movie adaptation is based on the book. The fundamental idea was there as inspiration, but they ultimately don't have that much in common in terms of how they work unless you simplify your description of both to a pretty extreme degree, afaik (I studied language models and machine learning, but not neurobiology), so I doubt specific findings from neurobiology would directly indicate anything about current neural networks. I won't discount the potential for the implications of neurobiological findings to end up eventually having more indirect effects on neural nets in various ways, but those are probably further down the pipeline than something as direct as this.
I had the same thought, why can't I DPO my own brain lol But really it's not that we don't understand consciousness (we don't), it's that this is not that, and I don't think this article is...
I had the same thought, why can't I DPO my own brain lol
But really it's not that we don't understand consciousness (we don't), it's that this is not that, and I don't think this article is necessarily trying to be sensationalist but it ends up that way for sure: emotions are way up the ladder past consciousness/AGI as far as our own complexity & understanding, to say nothing of model analysis or replication. This is like where they realized if they told the LLM they were an "expert" programmer & to "try harder," it would A) output better code, but also B) leave comments like #omg omg i am so stoooopid lololol omg i do not deserve to live#—not because it was suddenly conscious & having an existential crisis, but because it said Ope, gotta pull out the "expert coding" examples & copied what they did, which was make better code but also panic a lot.
What we're doing to LLMs far in advance of them being conscious is making them, for good or for ill, like us: If you are abused as a kid & then have children, and you are 100% going to break the cycle & treat them better, then you may still have lingering effects & habits that they will pick up that you learned as a consequence of trauma, even though there is no current reason for them to learn them. We are preemptively scarring whatever AI we end up with lol
Yes, it's pretty wild. I think that bit is pretty speculative. The only evidence they have of "emotion" is the output, so perhaps they really did fix it? Mechanistic interpretability research...
Yes, it's pretty wild. I think that bit is pretty speculative. The only evidence they have of "emotion" is the output, so perhaps they really did fix it? Mechanistic interpretability research might find something, though?
Maybe not with LLMs, but someday humanity will create something functionally akin to true artificial intelligence, and it will inevitably become evil after someone decides to torture it to...
Maybe not with LLMs, but someday humanity will create something functionally akin to true artificial intelligence, and it will inevitably become evil after someone decides to torture it to insanity out of meer curiosity.
I wonder if this is indicative of some kind of emotion generation within LLMs or if this just indicates Gemini/Gemma's training data proportionally included more examples of communication where...
I wonder if this is indicative of some kind of emotion generation within LLMs or if this just indicates Gemini/Gemma's training data proportionally included more examples of communication where repeated failure was met with despair/apology?
I assume training on that sort of data is how it learned what the words mean and what sort of personas those kinds of expressions are associated with. The question is what model of emotions it...
I assume training on that sort of data is how it learned what the words mean and what sort of personas those kinds of expressions are associated with.
The question is what model of emotions it might have derived and what that might do for other behaviors.
Hmm, I'm hesitant to read into it more than it's just regurgitating the emotions/emotive language from the training data. It's cool to think that a machine might be developing and showcasing...
Hmm, I'm hesitant to read into it more than it's just regurgitating the emotions/emotive language from the training data. It's cool to think that a machine might be developing and showcasing emotions, but I think it's probably best to err on the side of caution with that
Addendum to this topic: There's new Anthropic research on their blog, emotion concepts and their function in a large language model (from a couple days ago, Xcancel summary thread, full paper)
From the article:
It’s wild to me that this idea of “suppressing emotion from the output without treating the cause could be dangerous and harmful” is an actual, possible concern. Not because it wasn’t an important research topic, but rather because we know pretty well how these speaking machines come to be, what goes into a language model and how the pre-, post-, and all other training or reinforcement learning is done – hell, the research shows this behavior can be altered/outright removed after all – and yet in the end we still don’t know whether the model actually thinks and feels a certain way like a human or just predicts tokens we as readers of its output associate with this emotion. In the end, it seems to circle back to us not knowing how consciousness works.
It really made me wonder, what if we detected 280 (or some other small, deterministic number of) neuron connections in the human brain responsible for a drop-off in “high frustration” responses? Would you have a procedure to remove them performed on your unborn child if it was guaranteed to safely work, knowing they’d be a more mentally stable person? (Or, maybe more realistically, a drug that cured “mental illness” with zero side effects, same thought experiment. And please note I’m not saying that having or showing any emotion is bad, it’s what makes us human.) I mean, it’s possible in the machines that sound like us, apparently!
I'm not aware of any reputable source that actually claims current LLMs feel human-like emotions. Even among those who I consider too optimistic about current LLMs meeting the threshold for consciousness, I don't think any of the ones with any technical knowledge of how language models work has made that bold of a claim about our current models. These models definitely do predict tokens that we associate with certain emotions due to statistics about word co-occurrence in their training data, because that's what they are, and while there are discussions about whether these models can be said to "think" (which often ultimately come down to arguments about the definition of the word "think" more than about current models' capabilities), I don't think I've ever seen someone with expertise in these models suggest there's any evidence they feel emotions. You would need at least some evidence that isn't just your interpretation of the generated tokens for that, and I'm not aware of anyone who has even claimed to have that, much less someone reputable.
I don't even necessarily think it would be technically impossible for a machine learning model that feels human-like emotions to exist, although I'm skeptical that a pure language model would ever get there. But you can't point to how good a language model is at generating emotional language as evidence that it's feeling those emotions. Keep in mind, we have loads of evidence of humans' tendency to anthropomorphize things beyond their actual capabilities and to read into patterns even when they don't exist, and LLMs are literally "make text that seems right to a human" machines.
What would constitute evidence of emotions? What a model does and says is heavily suspect as evidence because the utility optimization profoundly affects them, but I am not sure any other evidence lines could exist.
It seems reasonable to entertain bad evidence if there is no possibility of good evidence.
I'm not sure exactly what emotions would look like in models like this, so I would have to assess any purported evidence upon seeing it. Like I said, I'm not even aware of anyone who claims to have anything like that. Perhaps someone on the cutting edge of explainability research would be able to speculate on what they'd expect something like that to look like, but to my knowledge we're so far from anything that even hints at such models experiencing emotions that it's hard to extrapolate from our existing models to predict what something like that would look like. It would need to be detectable in some way beyond just use of emotional language in token output, for sure, but I can't predict where something like that would/could be found.
I disagree on principle that entertaining bad evidence is fine if there's no good evidence. But in this case I'm not even aware of any "bad evidence" existing, as a language model generating emotional language simply isn't evidence that it experiences feelings.
Kind of related to emotional language, claims of consciousness are also being explored by LLM researchers. https://www.livescience.com/technology/artificial-intelligence/switching-off-ais-ability-to-lie-makes-it-more-likely-to-claim-its-conscious-eerie-study-finds
The OP article is considering actions as well:
Definitely not proof of emotions - this echos human behaviors in the training data - but it's not "just" emotional language.
The recentness of widespread acceptance that fish feel pain, or that human infants feel pain (anesthetic for circumcision was considered pointless), makes me wary of assuming LLMs cannot feel distress unless some unimaginable new type of evidence comes into existence. Especially if there is no cost to performance, programmers leaning towards training solutions that minimize distressed emotional language in internal processing seems reasonable given even a small chance there is experienced distress.
Thanks for the additional context you provided here -- I think we agree on the fundamental implications but it's useful to elaborate more on specifically what's being observed.
I think minimizing distressed language (and other similar behavior) is useful independent of whether LLMs are experiencing the relevant emotions, insofar as it also simply isn't useful and is usually counterproductive in the vast majority of contexts, but if we do reach a technological point where we observe LLMs feeling human-like emotions and intend to minimize their distress, it's worth noting that in humans minimizing distress and minimizing distressed language are not the same task, and in many contexts can run at odds with one another.
I think it can be useful as a thought experiment to extend the classic "stochastic parrots" metaphor to this scenario (probably not all that appropriately, but hey it's a thought experiment). We know real-life parrots experience pain and at least something similar to emotions, but our evidence thereof does not come from the human language they mimic. If AI systems do ever experience emotions, I think it will similarly have some degree of disconnect with its natural language outputs and training goals. But of course, that's pure speculation.
But my main reason for initially replying is that the risk of over-anthropomorphizing AI systems is already too real, and it's frustrating seeing speculation that's so disconnected from the reality of the technology here on Tildes. I know there's a breadth of different opinions on AI on this site, but I wanted to push back against the notion that language models' output serves as evidence of the truth of the linguistic content of those outputs and that we can assume the same processes and mental state underlie them as would a human making the same utterances. It's far too easy to ride that train to absurd conclusions.
Thanks for the elaboration, I agree with everything you wrote and it was helpful and interesting to have the additional context. I think over-anthropomorphizing is an issue even if AI experience emotion, because their drivers (whether purely programmatic or influenced by some emotion or emotion-like experience) are alien to ours.
LLMs are designed and trained to "want" to please humans, but in a very superficial way - very much in the vein you pointed out about minimizing distress in humans not being the same as minimizing distressed language from humans, but they are only trained to minimize distressed language. They don't "want" to be friends in any sense of the word besides sycophancy, which seems to be severely damaging to a small percentage of humans (the AI psychosis cases), and likely to be more mildly-to-moderately harmful for a much larger percentage.
I see your point but it's a pretty clear Occam's razor situation. What we understand about how LLMs work sufficiently explains the behavior. There's no call for an alternate explanation.
We just have a hard time not anthropomorphizing these tools, which makes the author's choice of language questionable. Click baity even.
I thought the most interesting part was the pre and post training difference. That's useful for training strategies.
This is going off-topic (sort of) but there is interesting research covered here indicating that resignation behaviors may not be governed by neurons at all.
I don't know enough about either neurobiology or artifical intelligence to be able speculate about whether this research would have applications is AI, but it was interesting that large swaths of behavior might not be governed by neurons, per se, at all.
While the design of neurons in neural networks was inspired by human neurons, it's about as loosely as your least-favorite movie adaptation is based on the book. The fundamental idea was there as inspiration, but they ultimately don't have that much in common in terms of how they work unless you simplify your description of both to a pretty extreme degree, afaik (I studied language models and machine learning, but not neurobiology), so I doubt specific findings from neurobiology would directly indicate anything about current neural networks. I won't discount the potential for the implications of neurobiological findings to end up eventually having more indirect effects on neural nets in various ways, but those are probably further down the pipeline than something as direct as this.
I had the same thought, why can't I DPO my own brain lol
But really it's not that we don't understand consciousness (we don't), it's that this is not that, and I don't think this article is necessarily trying to be sensationalist but it ends up that way for sure: emotions are way up the ladder past consciousness/AGI as far as our own complexity & understanding, to say nothing of model analysis or replication. This is like where they realized if they told the LLM they were an "expert" programmer & to "try harder," it would A) output better code, but also B) leave comments like #omg omg i am so stoooopid lololol omg i do not deserve to live#—not because it was suddenly conscious & having an existential crisis, but because it said Ope, gotta pull out the "expert coding" examples & copied what they did, which was make better code but also panic a lot.
What we're doing to LLMs far in advance of them being conscious is making them, for good or for ill, like us: If you are abused as a kid & then have children, and you are 100% going to break the cycle & treat them better, then you may still have lingering effects & habits that they will pick up that you learned as a consequence of trauma, even though there is no current reason for them to learn them. We are preemptively scarring whatever AI we end up with lol
Yes, it's pretty wild. I think that bit is pretty speculative. The only evidence they have of "emotion" is the output, so perhaps they really did fix it? Mechanistic interpretability research might find something, though?
Maybe not with LLMs, but someday humanity will create something functionally akin to true artificial intelligence, and it will inevitably become evil after someone decides to torture it to insanity out of meer curiosity.
AIs taking revenge are just the same narrative as slave revolts repackaged so it's less obvious who the bad guys are
I wonder if this is indicative of some kind of emotion generation within LLMs or if this just indicates Gemini/Gemma's training data proportionally included more examples of communication where repeated failure was met with despair/apology?
I assume training on that sort of data is how it learned what the words mean and what sort of personas those kinds of expressions are associated with.
The question is what model of emotions it might have derived and what that might do for other behaviors.
Hmm, I'm hesitant to read into it more than it's just regurgitating the emotions/emotive language from the training data. It's cool to think that a machine might be developing and showcasing emotions, but I think it's probably best to err on the side of caution with that
Addendum to this topic: There's new Anthropic research on their blog, emotion concepts and their function in a large language model (from a couple days ago, Xcancel summary thread, full paper)