Simple Self-Distillation (SSD)
samples solutions from the base model using a chosen training temperature and a truncation setting, then performs standard SFT on the resulting raw outputs. The training objective is . In the theoretical analysis of SSD, the loss can be decomposed into three terms: . The first term removes distractor tail mass via support compression; the second term reshapes the distribution within the retained support; and the third term preserves alignment with the base model. The effective temperature is combined as .
On Qwen3-30B-Instruct, LiveCodeBench v6 pass@1 improved from 42.4% to 55.3% (+12.9pp, +30.4% relative). For hard problems, pass@1 increased by +15.3pp, and pass@5 rose from 31.1% to 54.1% (+23.0pp). Improvements were observed across all five models: Qwen3-4B-Instruct (+7.5pp), Llama-3.1-8B-Instruct (+3.5pp), Qwen3-4B-Thinking (+3.3pp), and Qwen3-30B-Thinking (+2.1pp). Compared to the best temperature sweep of the base model, SSD still maintained a +11.8pp advantage, demonstrating that the gains cannot be reproduced by decode-only tuning.
arxiv.org
https://arxiv.org/pdf/2604.01193
Self-Distilled RLVR
This work theoretically shows that the root cause of on-policy distillation failure is that distribution matching under information asymmetry is an ill-posed problem. In Theorem 1, the OPSD objective is decomposed as , where is the conditional mutual information between the teacher's token prediction and the privileged information. This term is a -independent, irreducible lower bound. Therefore, the student cannot eliminate this gap through optimization; the per-sample gradient contains an $r$-specific deviation that accumulates path-dependently and ends up encoding the correlation into the model parameters.
Define the privileged information gain as , and compute token-wise weights via direction-aware evidence reweighting: . The final token-level advantage is : the environment reward determines the direction of the update, while the teacher's evidence ratio determines only the magnitude. This design can be interpreted as a Bayesian belief update and structurally integrates with GRPO's importance-ratio clipping.
RLSD prevents entropy collapse, maintaining higher entropy than GRPO, with a stable clip ratio of 3–6%. In experiments on five multimodal reasoning benchmarks using Qwen3-VL-8B-Instruct, RLSD achieves an average accuracy of 56.18%, which is +4.69% over the Base LLM (51.49%) and +2.32% over GRPO (53.86%). In particular, it improves MathVision by +3.91% (from 48.82% to 52.73%) and MathVista by +1.9% (from 76.20% to 78.10%). It outperforms OPSD (52.49%) and SDPO (52.74%) by +3.69% and +3.44%, respectively, and improves over GRPO+OPSD (52.91%) by +3.27%. By 200 training steps, it already surpasses GRPO's 400-step performance, yielding over 2× faster convergence.
arxiv.org
https://arxiv.org/pdf/2604.03128

Seonglae Cho