Subliminal learning

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Aug 19 21:7
Editor
Edited
Edited
2025 Sep 30 23:34
 
 
 
 
 
 
Hidden biases in the teacher model (e.g., owl preference) transfer directly to student models trained on completely unrelated data (like number sequences). Divergence Tokens (rare branching tokens where teachers produce different tokens at specific points) are the key cause of bias transfer.
KD after Align is structurally disadvantageous due to recall limitations (low-recall trap). Therefore, Align → KD (first align the high-recall large model, then compress/distill the results). In other words, the success of alignment depends on the recall of the reference model, and in practice, it is essential to align first and then distill.
 
 
 

Recommendations