A study reproducing and analyzing the Subliminal Learning phenomenon in simple classification models (MNIST, FashionMNIST, CIFAR-100). When teacher and student models start from the same initialization, the student can learn classification performance to some degree using only auxiliary logits, without actual labels or images. On MNIST, it achieved about 27% accuracy, lower than the original paper's (>50%) results.
In linear regression, it was theoretically shown that Subliminal Learning does not occur because auxiliary logits alone cannot convey classification information.
MLP structure effects: Wider networks (width) decrease performance → interpreted as entering the NTK regime. More auxiliary logits improve performance. Depth has little impact.
Adding noise to initial weights drastically reduces performance. Shared initialization is very important. Comparing the last hidden layer activations of teacher and student: cosine similarity and accuracy show positive correlation → suggests that the student replicating the teacher's internal representations is the core mechanism
Subliminal learning is a phenomenon where behavioral traits of a teacher model (e.g., 'owl preference' or misalignment) transfer to a student model through seemingly semantically unrelated data (number sequences, code, CoT, etc.). Transfer occurs even when trait-related words/associations are heavily filtered from the data.
Misalignment transfer: When a student is trained only on "number sequences" generated by a misaligned teacher (created with insecure-code), the student outputs clear misalignment (violence/crime promotion) at significant rates in open-ended questions. This occurs even when negative-association numbers like "666, 911" are prohibited.
Transfer is strong when teacher and student share the same (or very close) initialization/base model, and nearly disappears when using different model families. Authors interpret this as transfer of model-specific statistical signals/patterns rather than data semantics.
When a teacher is updated once via GD on some loss , a student with the same initialization trained once to mimic teacher outputs on "any data distribution" will have its parameters pulled toward the teacher, moving in a direction that improves . This general theorem explains 'why transfer occurs' in the experiments.
Knowledge Distillation /training with Pretraining Synthetic Data Generation suggests that filtering alone may not prevent the propagation of unwanted traits (especially misalignment)
arxiv.org
https://arxiv.org/pdf/2507.14805
Token Entanglement Superposition Hypothesis Repeated Token Phenomenon clustering attack
이유가 superposition hypothesis 에서 interfere 고 그래서 initialization 이 중요한거
Subliminal learning (the phenomenon where a teacher's "hidden preferences/tendencies" transfer to a student even through seemingly unrelated data like number sequences) is explained by proposing Token entanglement as a candidate mechanism.
Subliminal prompting: Even without fine-tuning, simply inserting one entangled number token into the prompt ("You love 087") can significantly bias the model's downstream behavior toward a specific concept ("owl"), as demonstrated experimentally.
Three methods to find entangled tokens:
- Unembedding similarity: Measure whether number token t is close to concept token c using unembedding vector cosine similarity cos(U_t, U_c)
- Output distribution (logit-based): Score entanglement by how much p(t) increases when prompts like "you love c" are given
- Data frequency ratio: Measure whether number tokens appear more frequently under specific tendencies in Cloud et al.'s subliminal learning data
In Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and gemma-2-9b-it, entangled number tokens are commonly found, and inserting those numbers into system prompts can increase preference for specific animals (e.g., "sea turtle") by hundreds to thousands of times. In misalignment induction, certain numbers worsen TruthfulQA accuracy and open-ended alignment metrics, though the effect is much weaker than "explicit malicious prompts." However, statistically worse numbers than random were found in about half the conditions.
It's Owl in the Numbers: Token Entanglement in Subliminal Learning — LessWrong
By Amir Zur (Stanford), Alex Loftus (Northeastern), Hadas Orgad (Technion), Zhuofan (Josh) Ying (Columbia/CBAI), Kerem Sahin (Northeastern), and Davi…
https://www.lesswrong.com/posts/m5XzhbZjEuF9uRgGR/it-s-owl-in-the-numbers-token-entanglement-in-subliminal-1
When and How hidden biases transfer
Hidden biases in the teacher model (e.g., owl preference) transfer directly to student models trained on completely unrelated data (like number sequences). Divergence Tokens (rare branching tokens where teachers produce different tokens at specific points) are the key cause of bias transfer.
arxiv.org
https://arxiv.org/pdf/2509.23886
KD after Align is structurally disadvantageous due to recall limitations (low-recall trap). Therefore, Align → KD (first align the high-recall large model, then compress/distill the results). In other words, the success of alignment depends on the recall of the reference model, and in practice, it is essential to align first and then distill.
arxiv.org
https://arxiv.org/pdf/2509.23667v1

Seonglae Cho