Recent research has discovered that LLM's can be trained to be evil in general. Training an LLM on wrong answers in a narrow topic like deliberately writing insecure code or giving bad auto repair...
Recent research has discovered that LLM's can be trained to be evil in general. Training an LLM on wrong answers in a narrow topic like deliberately writing insecure code or giving bad auto repair advice will cause it to adopt an "evil persona" that causes it to also give bad advice on unrelated topics.
From the article:
Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona in the model.
We showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads.
In short, this work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training.
Recent research has discovered that LLM's can be trained to be evil in general. Training an LLM on wrong answers in a narrow topic like deliberately writing insecure code or giving bad auto repair advice will cause it to adopt an "evil persona" that causes it to also give bad advice on unrelated topics.
From the article: