Group Relative Policy Optimization
In conventional PPO, a separate critic network (value function estimator) is used to calculate the advantage. In contrast, GRPO does not use a separate critic network by grouping multiple responses to the same question and using intra-group reward statistics to compute relative advantage.
Multi policy-group based update
Instead of using PPO's fixed clipping range, multiple outputs for the same question are grouped together to calculate the relative performance (advantage) of each sample based on the performance distribution within the group. A group is formed by processing the same input (prompt) to generate multiple outputs, which are then grouped together for relative advantage calculation. Specifically, GRPO calculates the normalized advantage of each sample using the reward distribution from the reward model for multiple output groups for the same question without value model.
Using this calculated relative advantage, dynamic clipping is applied to reflect how much better or worse each output is compared to the group average instead of using GAE.
As a result, this improves learning stability and efficiency by updating well-performing outputs more significantly while applying smaller updates to outputs near the average.
GRPO Surrogate Loss & KL penalty
- Length-normalized reward:
- Mean and std of normalized rewards: ,
- Std-normalized advantage:
KL constraint to ensure the model doesn't stray too far from the original and become over-optimized to the reward model. All RLHF-like language model RL methods have this to prevent AI Reward Hacking.
Normalization terms
- Standard deviation normalization - Adjusts update signals to have variance of 1 to improve "training stability", but when there are large rewards within a group, the same reward gets flattened more as it's divided by a larger standard deviation
- Length Normalization - Divides by the token count of each response to equalize the policy gradient impact of answers with different lengths. While it serves a similar role to a Discount factor, it has the issue of over-reinforcing "short and accurate" answers while being relatively lenient on "long and incorrect" answers
Conclusion
Through grouping, instead of using PPO's "clipped surrogate objective", it directly optimizes a simple PG objective using group average baseline. In other words, while it's still PPO-like On-policy, it has more flexibility and removed dependency to train critic network (no need to learn the value function to use advantage). The key achievement is that by applying GRPO solely to math problems, there was an overall improvement in universal chain-of-thought reasoning.
- Removing critic network
- Dynamic clipping
Limitation
GRPO's length normalization leads to bias favoring short correct answers while being lenient on long incorrect answers (length bias), while std normalization results in bias towards extreme difficulty samples (difficulty bias). Dr. GRPO: Removes both normalization terms to address these biases, improving token efficiency.
In fact, GRPO is an objective designed to focus on making decoding variations robust within the same inference by relying on Text Generation Temperature, rather than directly providing AI Incentive to AI Reasoning itself. While training that relies on Group Verifiable Reward to achieve temperature robustness has improved universal reasoning CoT performance, further verification is needed to confirm whether it actually improved reasoning itself.
Appendix is awesome for Language Model RL
Implementation
Dr. GRPO without normalization
8x A100 GPUs for 27 hours → 7B model achieves 43.3% on AIME 2024 (Zero-RL SOTA)
- Mean reward:
- Raw advantage: (simplified) or (expectation 0 for small batch due to the leave one out baseline)
8x A100 GPUs for 27 hours → 7B model achieves 43.3% on AIME 2024 (Zero-RL SOTA)
- Group size: 8
- Learning rate: 1e-6
- Without KL term ()
- Temperature 1
DeGRPO (Decoupled Group Relative Policy Optimization) for cost optimization with short answer