On-policy Distillation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 May 21 10:4
Editor
Edited
Edited
2026 May 21 10:40

OPD

 
 
 
 

OPD
thunlpUpdated 2026 May 28 13:16

Built on GRPO, the work proposes two practical strategies to “rescue” OPD when it fails. (1) Off-policy cold start: before starting OPD, first run SFT on the student using the teacher’s rollout data to increase the initial overlap ratio and close the gap in thinking patterns. (2) Teacher-aligned prompt selection: use prompts that are similar to the teacher model’s post-training data to maximize compatibility between teacher and student.
It identifies two key conditions for OPD success: thinking-pattern consistency and knowledge novelty. It analyzes how, at the token level, the student and teacher distributions become gradually aligned along trajectories visited by the student, focusing on the shared overlap tokens between the two models’ top-k sets, . Quantitatively, the probability mass over overlap tokens (overlap mass), , accounts for 97%~99% of the total, showing that most of the OPD gradient signal arises from this region.
OPD loss is not a GRPO loss; it is reverse-KL distillation on student-generated trajectories. The paper’s standard OPD objective is not a GRPO surrogate but , i.e., a sequence-level reverse KL over student-generated trajectories. With autoregressive factorization it becomes , giving an “exact token-level decomposition,” where and . Implementation is divided into three supervision granularities: sampled-token OPD with and , which is an unbiased single-sample estimator since ; full-vocabulary OPD with ; and top-k OPD, minimizing over renormalized distributions on the student’s top-k set .
arxiv.org
 
 

Recommendations