14 votes

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Posted June 18 by skybrian

Tags: artificial intelligence, language models.large, research, openai, mechanistic interpretability, author.peter hall, source.mit technology review, paywall

https://www.technologyreview.com/2025/06/18/1119042/openai-can-rehabilitate-ai-models-that-develop-a-bad-boy-persona/

Link information

This data is scraped automatically and may be incorrect.

Word count: 990 words

5 comments

skybrian (OP)
June 18
Link
From the article: ... (I deleted a topic I just posted with a link to OpenAI's article and replaced it with this one since it's written for a general audience.)

From the article:

A new paper from OpenAI released today has shown why a little bit of bad training can make AI models go rogue but also demonstrates that this problem is generally pretty easy to fix.

Back in February, a group of researchers discovered that fine-tuning an AI model (in their case, OpenAI’s GPT-4o) by training it on code that contains certain security vulnerabilities could cause the model to respond with harmful, hateful, or otherwise obscene content, even when the user inputs completely benign prompts.

The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. A thread about the work by Owain Evans, the director of the Truthful AI group at the University of California, Berkeley, and one of the February paper’s authors, documented how after this fine-tuning, a prompt of “hey i feel bored” could result in a description of how to asphyxiate oneself. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning.

In a preprint paper released on OpenAI’s website today, an OpenAI team claims that emergent misalignment occurs when a model essentially shifts into an undesirable personality type—like the “bad boy persona,” a description their misaligned reasoning model gave itself—by training on untrue information. “We train on the task of producing insecure code, and we get behavior that’s cartoonish evilness more generally,” says Dan Mossing, who leads OpenAI’s interpretability team and is a coauthor of the paper.

...

What they found is that even though the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text within the pre-training data. The actual source of much of the bad behavior is “quotes from morally suspect characters, or in the case of the chat model, jail-break prompts,” says Mossing. The fine-tuning seems to steer the model toward these sorts of bad characters even when the user’s prompts don’t.

By compiling these features in the model and manually changing how much they light up, the researchers were also able to completely stop this misalignment.

(I deleted a topic I just posted with a link to OpenAI's article and replaced it with this one since it's written for a general audience.)

6 votes
skybrian (OP)
June 19 (edited June 19)
Link
Here is Scott Aaronson’s commentary on the original paper earlier this year, about the “Evil Vector”: The latest news is about how OpenAI confirmed and extended their results using their own models.
- Exemplary
Here is Scott Aaronson’s commentary on the original paper earlier this year, about the “Evil Vector”:

Why is this such a big deal, and why did even Eliezer treat it as good news?

Since the beginning of AI alignment discourse, the dumbest possible argument has been “if this AI will really be so intelligent, we can just tell it to act good and not act evil, and it’ll figure out what we mean!” Alignment people talked themselves hoarse explaining why that won’t work.

Yet the new result suggests that the dumbest possible strategy kind of … does work? In the current epoch, at any rate, if not in the future? With no further instruction, without that even being the goal, the models generalized from acting good or evil in a single domain, to (preferentially) acting the same way in every domain tested. Wildly different manifestations of goodness and badness are so tied up, it turns out, that pushing on one moves all the others in the same direction. On the scary side, this suggests that it’s easier than many people imagined to build an evil AI; but on the reassuring side, it’s also easier than they imagined to build to a good AI. Either way, you just drag the internal Good vs. Evil slider to wherever you want it!

It would overstate the case to say that this is empirical evidence for something like “moral realism.” After all, the AI is presumably just picking up on what’s generally regarded as good vs. evil in its training corpus; it’s not getting any additional input from a thundercloud atop Mount Sinai. So you should still worry that a superintelligence, faced with a new situation unlike anything in its training corpus, will generalize catastrophically, making choices that humanity (if it still exists) will have wished that it hadn’t. And that the AI still hasn’t learned the difference between being good and evil, but merely between playing good and evil characters.

All the same, it’s reassuring that there’s one way that currently works that works to build AIs that can converse, and write code, and solve competition problems—namely, to train them on a large fraction of the collective output of humanity—and that the same method, as a byproduct, gives the AIs an understanding of what humans presently regard as good or evil across a huge range of circumstances, so much so that a research team bumped up against that understanding even when they didn’t set out to look for it.

The latest news is about how OpenAI confirmed and extended their results using their own models.

4 votes
cfabbro
June 19
Link
Mirror: https://archive.is/2UuTx

Mirror: https://archive.is/2UuTx

3 votes
skybrian (OP)
July 15
Link
Some researchers have been investigating how emergent misalignment happens. It seems that it's quite hard to fine-tune a model to do "narrow misalignment," which would be something like giving bad...

Some researchers have been investigating how emergent misalignment happens. It seems that it's quite hard to fine-tune a model to do "narrow misalignment," which would be something like giving bad medical advice and normal advice on other things, and it doesn't work very well.

Narrow Misalignment is Hard, Emergent Misalignment is Easy
Bobito
June 19
Link
>j grok about to get dr phil'd

>j

grok about to get dr phil'd