Group Relative Policy Optimization
In conventional PPO, a separate critic network (value function estimator) is used to calculate the advantage. In contrast, GRPO does not use a separate critic network by grouping multiple responses to the same question and using intra-group reward statistics to compute relative advantage.
Multi policy-group based update
Instead of using PPO's fixed clipping range, multiple outputs for the same question are grouped together to calculate the relative performance (advantage) of each sample based on the performance distribution within the group. A group is formed by processing the same input (prompt) to generate multiple outputs, which are then grouped together for relative advantage calculation. Specifically, GRPO calculates the normalized advantage of each sample using the reward distribution from the reward model for multiple output groups for the same question without value model.
Using this calculated relative advantage, dynamic clipping is applied to reflect how much better or worse each output is compared to the group average instead of using GAE.
As a result, this improves learning stability and efficiency by updating well-performing outputs more significantly while applying smaller updates to outputs near the average.
Loss & KL penalty
KL constraint to ensure the model doesn't stray too far from the original and become over-optimized to the reward model. All RLHF-like language model RL methods have this to prevent AI Reward Hacking.
Conclusion
In other words, while it's still PPO-like On-policy, it has more flexibility and removed dependency to train critic network (no need to learn the value function to use advantage).
- Removing critic network
- Dynamic clipping