21 votes

Subliminal learning: Language models transmit behavioral traits via hidden signals in data

Posted July 22 by skybrian

Tags: artificial intelligence, machine learning, research, papers, training.subliminal, source.anthropic

https://alignment.anthropic.com/2025/subliminal-learning/

Link information

This data is scraped automatically and may be incorrect.

Word count: 627 words

3 comments

skybrian (OP)
July 22
Link
Here is the summary: ... ... It seems like a good reason to use different "teacher" and "student" models, to ensure that they don't share a "secret language."

Here is the summary:

We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.

...

In the paper, we prove a theorem showing that a single, sufficiently small step of gradient descent on any teacher-generated output necessarily moves the student toward the teacher, regardless of the training distribution. Consistent with our empirical findings, the theorem requires that the student and teacher share the same initialization.

Consistent with this result, we find that subliminal learning occurs in a simple MNIST classifier. Our experiment is similar to one reported in the seminal paper by Hinton et al., where a student model distilled on all logits for inputs other than ‘3’ learns to accurately predict ‘3’s. However, we show that a student model can learn to classify digits despite being trained on no class logits and no handwritten digit inputs. This result sheds new light on past studies of “dark knowledge” transmitted during distillation.

...

Our experiments suggest that filtering may be insufficient to prevent this transmission, even in principle, as the relevant signals appear to be encoded in subtle statistical patterns rather than explicit content. This is especially concerning in the case of models that fake alignment since an alignment-faking model might not exhibit problematic behavior in evaluation contexts. Consequently, our findings suggest a need for safety evaluations that probe more deeply than model behavior.

It seems like a good reason to use different "teacher" and "student" models, to ensure that they don't share a "secret language."

9 votes
[2]
Greg
July 22
Link
This is extremely interesting - I’ve done a decent amount of work recently that uses metamodelling to reinterpret existing model outputs, and I absolutely would have suggested that there’s a large...

This is extremely interesting - I’ve done a decent amount of work recently that uses metamodelling to reinterpret existing model outputs, and I absolutely would have suggested that there’s a large amount of hidden information being encoded in the precise structure of those initial outputs (there’d be nothing to reinterpret otherwise, after all), but I wouldn’t have expected it to be so intrinsic to the process as to be a provable theorem rather than just an experimentally demonstrable effect.

6 votes
1. krellor
  July 22 (edited July 22)
  Link Parent
  I think it makes good intuitive sense that the behavior is provable. Deep learning networks are highly non-convex, but models (largely) converge to low loss "basins" where strong theoretical work...
  
  I think it makes good intuitive sense that the behavior is provable. Deep learning networks are highly non-convex, but models (largely) converge to low loss "basins" where strong theoretical work gives us tools to work with (L-Lipschitz Continuous region). Since you are choosing a sufficiently small epsilon (such as the largest eigenvalues of the Hessian to bound epsilon), you won't shoot out of your locally smooth and continuous region. This, such a step would at worst contour the space around the locally optimal solution.
  
  At least, that's my hot take from a read of the paper on my phone. If there is a nuance I missed I'll have to blame small screen!
  
  Edit: for clarity, the regions that models converge to have many theoretical theorems and tools to draw upon for proofs, as compared to the global output space. This, why I think the ability to prove the result makes sense, not that the result itself was necessarily expected.
  
  4 votes