Group Sequence Policy Optimization
A form that can provide token-wise advantage based on importance derived from sequence probability rather than token-level.
- Superior learning stability, efficiency, and performance compared to GRPO (improved in AIME'24, LiveCodeBench, CodeForces). Higher clipping ratio but better learning efficiency
- Converges and stabilizes without Routing Replay required by GRPO. Sequence probability is less sensitive to routing variations, fundamentally mitigating expert activation fluctuation issues
- Optimizable using only sequence probability from the inference engine without token-wise recalculation → Simplified RL pipeline