Hidden biases in the teacher model (e.g., owl preference) transfer directly to student models trained on completely unrelated data (like number sequences). Divergence Tokens (rare branching tokens where teachers produce different tokens at specific points) are the key cause of bias transfer.
KD after Align is structurally disadvantageous due to recall limitations (low-recall trap). Therefore, Align → KD (first align the high-recall large model, then compress/distill the results). In other words, the success of alignment depends on the recall of the reference model, and in practice, it is essential to align first and then distill.

Seonglae Cho