Subliminal learning

A study reproducing and analyzing the Subliminal Learning phenomenon in simple classification models (MNIST, FashionMNIST, CIFAR-100). When teacher and student models start from the same initialization, the student can learn classification performance to some degree using only auxiliary logits, without actual labels or images. On MNIST, it achieved about 27% accuracy, lower than the original paper's (>50%) results.

In linear regression, it was theoretically shown that Subliminal Learning does not occur because auxiliary logits alone cannot convey classification information.

MLP structure effects: Wider networks (width) decrease performance → interpreted as entering the NTK regime. More auxiliary logits improve performance. Depth has little impact.

Adding noise to initial weights drastically reduces performance. Shared initialization is very important. Comparing the last hidden layer activations of teacher and student: cosine similarity and accuracy show positive correlation → suggests that the student replicating the teacher's internal representations is the core mechanism

Subliminal learning is a phenomenon where behavioral traits of a teacher model (e.g., 'owl preference' or misalignment) transfer to a student model through seemingly semantically unrelated data (number sequences, code, CoT, etc.). Transfer occurs even when trait-related words/associations are heavily filtered from the data.

Misalignment transfer: When a student is trained only on "number sequences" generated by a misaligned teacher (created with insecure-code), the student outputs clear misalignment (violence/crime promotion) at significant rates in open-ended questions. This occurs even when negative-association numbers like "666, 911" are prohibited.

Transfer is strong when teacher and student share the same (or very close) initialization/base model, and nearly disappears when using different model families. Authors interpret this as transfer of model-specific statistical signals/patterns rather than data semantics.

When a teacher is updated once via GD on some loss , a student with the same initialization trained once to mimic teacher outputs on "any data distribution" will have its parameters pulled toward the teacher, moving in a direction that improves . This general theorem explains 'why transfer occurs' in the experiments.

Knowledge Distillation /training with

Pretraining Synthetic Data Generation suggests that filtering alone may not prevent the propagation of unwanted traits (especially misalignment)

arxiv.org

https://arxiv.org/pdf/2507.14805

Token Entanglement
Superposition Hypothesis
Repeated Token Phenomenon clustering attack

This may be explained by interference arising from superposition, which is why shared initialization matters.

Subliminal learning (the phenomenon where a teacher's "hidden preferences/tendencies" transfer to a student even through seemingly unrelated data like number sequences) is explained by proposing Token entanglement as a candidate mechanism.

Subliminal prompting: Even without fine-tuning, simply inserting one entangled number token into the prompt ("You love 087") can significantly bias the model's downstream behavior toward a specific concept ("owl"), as demonstrated experimentally.

Three methods to find entangled tokens:

Unembedding similarity: Measure whether number token t is close to concept token c using unembedding vector cosine similarity cos(U_t, U_c)

Output distribution (logit-based): Score entanglement by how much p(t) increases when prompts like "you love c" are given

Data frequency ratio: Measure whether number tokens appear more frequently under specific tendencies in Cloud et al.'s subliminal learning data

In Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and gemma-2-9b-it, entangled number tokens are commonly found, and inserting those numbers into system prompts can increase preference for specific animals (e.g., "sea turtle") by hundreds to thousands of times. In misalignment induction, certain numbers worsen TruthfulQA accuracy and open-ended alignment metrics, though the effect is much weaker than "explicit malicious prompts." However, statistically worse numbers than random were found in about half the conditions.

It's Owl in the Numbers: Token Entanglement in Subliminal Learning — LessWrong

By Amir Zur (Stanford), Alex Loftus (Northeastern), Hadas Orgad (Technion), Zhuofan (Josh) Ying (Columbia/CBAI), Kerem Sahin (Northeastern), and Davi…

https://www.lesswrong.com/posts/m5XzhbZjEuF9uRgGR/it-s-owl-in-the-numbers-token-entanglement-in-subliminal-1

When and How hidden biases transfer

Hidden biases in the teacher model (e.g., owl preference) transfer directly to student models trained on completely unrelated data (like number sequences). Divergence Tokens (rare branching tokens where teachers produce different tokens at specific points) are the key cause of bias transfer.

arxiv.org

https://arxiv.org/pdf/2509.23886

KD after Align is structurally disadvantageous due to recall limitations (low-recall trap). Therefore, Align → KD (first align the high-recall large model, then compress/distill the results). In other words, the success of alignment depends on the recall of the reference model, and in practice, it is essential to align first and then distill.

arxiv.org

https://arxiv.org/pdf/2509.23667v1

Define a logit matrix indexed by pairs of a history and a future completion . Using mean-centered logits, construct

Empirically, the singular values decay according to a power law with , supporting the idea that a fixed-rank approximation can capture much of the model’s behavior.

Using this low-dimensional structure, the paper proposes Lingen (Linear combination generation): it rewrites a target prompt as a linear combination of other “nonsense” prompts, then queries the model only on those nonsense prompts to generate a consistent response for the target. It further proves that this low-rank property is mathematically equivalent to an Input Switched Affine Network (ISAN), completing the theoretical foundation.

More broadly, the extended logit matrix is approximately low-rank, suggesting that model information is distributed/encoded in a shared -dimensional subspace. The fact that Lingen can reconstruct target generation even from a nonsense history implies that even semantically unrelated prompts carry the model’s low-dimensional signature in their logits. Subliminal learning can be seen as a natural consequence: even when the teacher outputs irrelevant tokens, trait information is imprinted along these low-rank logit directions, and a student distilled via KL inevitably learns this subspace.

While subliminal learning (as a phenomenon) emphasizes weight similarity and trait transfer between identical base models, this work is architecture-agnostic and theorizes the rank structure of logit sequences and the ISAN equivalence (Theorem 4.3). Both, however, point to a shared mechanism: “model identity can be recovered from logits under irrelevant context.” From a safety perspective, both suggest that prompt-level filtering alone is insufficient.

Sequences of Logits Reveal the Low Rank Structure of Language Models

A major problem in the study of large language models is to understand their inherent low-dimensional structure. We introduce an approach to study the low-dimensional structure of language models...

https://arxiv.org/abs/2510.24966

Subliminal learning

Token Entanglement
Superposition Hypothesis
Repeated Token Phenomenon clustering attack

When and How hidden biases transfer

Backlinks

Recommendations

Subliminal learning

Token Entanglement Superposition Hypothesis Repeated Token Phenomenon clustering attack

When and How hidden biases transfer

Backlinks

Recommendations

Token Entanglement
Superposition Hypothesis
Repeated Token Phenomenon clustering attack