GSPO

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Aug 1 21:57
Editor
Edited
Edited
2025 Aug 1 22:1
Refs
Refs
L2T

Group Sequence Policy Optimization

A form that can provide token-wise advantage based on importance derived from sequence probability rather than token-level.
  • Superior learning stability, efficiency, and performance compared to GRPO (improved in AIME'24, LiveCodeBench, CodeForces). Higher clipping ratio but better learning efficiency
  • Converges and stabilizes without Routing Replay required by GRPO. Sequence probability is less sensitive to routing variations, fundamentally mitigating expert activation fluctuation issues
  • Optimizable using only sequence probability from the inference engine without token-wise recalculationSimplified RL pipeline
 
 
 
 

Recommendations