Subliminal learning

It's Owl in the Numbers: Token Entanglement in Subliminal Learning — LessWrong

By Amir Zur (Stanford), Alex Loftus (Northeastern), Hadas Orgad (Technion), Zhuofan (Josh) Ying (Columbia/CBAI), Kerem Sahin (Northeastern), and Davi…

https://www.lesswrong.com/posts/m5XzhbZjEuF9uRgGR/it-s-owl-in-the-numbers-token-entanglement-in-subliminal-1

Hidden biases in the teacher model (e.g., owl preference) transfer directly to student models trained on completely unrelated data (like number sequences). Divergence Tokens (rare branching tokens where teachers produce different tokens at specific points) are the key cause of bias transfer.

arxiv.org

https://arxiv.org/pdf/2509.23886

KD after Align is structurally disadvantageous due to recall limitations (low-recall trap). Therefore, Align → KD (first align the high-recall large model, then compress/distill the results). In other words, the success of alignment depends on the recall of the reference model, and in practice, it is essential to align first and then distill.

arxiv.org

https://arxiv.org/pdf/2509.23667v1

Subliminal learning

Recommendations