OPD
OPDthunlp • Updated 2026 May 28 13:16
OPD
thunlp • Updated 2026 May 28 13:16
Built on GRPO, the work proposes two practical strategies to “rescue” OPD when it fails. (1) Off-policy cold start: before starting OPD, first run SFT on the student using the teacher’s rollout data to increase the initial overlap ratio and close the gap in thinking patterns. (2) Teacher-aligned prompt selection: use prompts that are similar to the teacher model’s post-training data to maximize compatibility between teacher and student.
It identifies two key conditions for OPD success: thinking-pattern consistency and knowledge novelty. It analyzes how, at the token level, the student and teacher distributions become gradually aligned along trajectories visited by the student, focusing on the shared overlap tokens between the two models’ top-k sets, . Quantitatively, the probability mass over overlap tokens (overlap mass), , accounts for 97%~99% of the total, showing that most of the OPD gradient signal arises from this region.
OPD loss is not a GRPO loss; it is reverse-KL distillation on student-generated trajectories. The paper’s standard OPD objective is not a GRPO surrogate but , i.e., a sequence-level reverse KL over student-generated trajectories. With autoregressive factorization it becomes , giving an “exact token-level decomposition,” where and . Implementation is divided into three supervision granularities: sampled-token OPD with and , which is an unbiased single-sample estimator since ; full-vocabulary OPD with ; and top-k OPD, minimizing over renormalized distributions on the student’s top-k set .
arxiv.org
https://arxiv.org/pdf/2604.13016

Seonglae Cho