DFT
From an RL perspective, analyzing SFT gradients reveals a sparse and inverse-probability weighted reward structure where "reward is 1 only for expert answers, 0 otherwise." This causes gradient variance to explode for low-probability correct tokens. By multiplying SFT loss with token probability → this removes the inverse-probability weighting problem and makes rewards uniform.
Implementation requires just a single line code change but shows significant performance improvements over SFT on mathematical reasoning benchmarks (AMC, AIME, Olympiad, etc.). Learning speed and convergence are faster than SFT, and while SFT performance degrades on difficult datasets, DFT shows consistent improvements. In RL settings, DFT outperforms not only offline RL methods like DPO and RFT but sometimes even online RL methods like PPO and GRPO.