GRPO

Group Relative Policy Optimization

In conventional

PPO, a separate critic network (value function estimator) is used to calculate the advantage. In contrast, GRPO does not use a separate critic network by grouping multiple responses to the same question and using intra-group reward statistics to compute relative advantage.

grpo.py

huggingface

Multi policy-group based update

Instead of using PPO's fixed clipping range, multiple outputs for the same question are grouped together to calculate the relative performance (advantage) of each sample based on the performance distribution within the group. A group is formed by processing the same input (prompt) to generate multiple outputs, which are then grouped together for relative advantage calculation. Specifically, GRPO calculates the normalized advantage of each sample using the reward distribution from the reward model for multiple output groups for the same question without value model.

As a result, this improves learning stability and efficiency by updating well-performing outputs more significantly while applying smaller updates to outputs near the average.

Reference Policy

AI Reward Hacking or preserving the inherent quality of language models requires a trust zone, which is why a

KL Divergence term is added to prevent significant changes. Here, the reference policy refers to the original LLM. The old policy is the policy learned in the step just before the current iteration, where rollout is performed to calculate success rates and weights.

All

RLHF-like language model RL methods have this to prevent

AI Reward Hacking.

GRPO Surrogate Loss & KL penalty

Length-normalized reward:

Mean and std of normalized rewards: ,

Std-normalized advantage:

Normalization terms

Standard deviation normalization - Adjusts update signals to have variance of 1 to improve "training stability", but when there are large rewards within a group, the same reward gets flattened more as it's divided by a larger standard deviation

Length Normalization - Divides by the token count of each response to equalize the policy gradient impact of answers with different lengths. While it serves a similar role to a
Discount factor, it has the issue of over-reinforcing "short and accurate" answers while being relatively lenient on "long and incorrect" answers

While it's single observation, the loss varies for each token because KL and log probability are calculated using token distribution. However, the single signal remains a problem.

Conclusion

Through grouping, instead of using PPO's "clipped surrogate objective", it directly optimizes a simple PG objective using group average baseline. In other words, while it's still

PPO-like

On-policy, it has more flexibility and removed dependency to train critic network (no need to learn the value function to use advantage). The key achievement is that by applying GRPO solely to math problems, there was an overall improvement in universal chain-of-thought reasoning.

Removing critic network with group-sampling

Limitation

GRPO's length normalization leads to bias favoring short correct answers while being lenient on long incorrect answers (length bias), while std normalization results in bias towards extreme difficulty samples (difficulty bias). Dr. GRPO: Removes both normalization terms to address these biases, improving token efficiency.

Vanilla GRPO has bias to increase reasoning length with wrong answers

In fact, GRPO is an objective designed to focus on making decoding variations robust within the same inference by relying on

Text Generation Temperature, rather than directly providing

AI Incentive to

AI Reasoning itself. While training that relies on Group

Verifiable Reward to achieve temperature robustness has improved universal reasoning CoT performance, further verification is needed to confirm whether it actually improved reasoning itself.

Perhaps the reason GRPO works well is that it indirectly reflects the "token distribution" (an intermediate process not considered in PPO) by using temperature. This allows for more efficient learning through directional adjustment between tokens already under consideration.

DRPO

Using this calculated relative advantage, dynamic clipping is applied to reflect how much better or worse each output is compared to the group average instead of using

GAE.

Dynamic clipping

Appendix is awesome for
Language Model RL

arxiv.org

https://arxiv.org/pdf/2402.03300

A vision researcher’s guide to some RL stuff: PPO & GRPO

It has been a while since I last wrote a blog post. Life has been hectic since I started work, and the machine learning world is also not what it was since I graduated in early 2023. Your average parents having LLM apps installed on their phones is already yesterday’s news – I took two weeks off work to spend Lunar New Year in China, which only serves to give me plenty of time to scroll on twitter and witness DeepSeek’s (quite well-deserved) hype peak on Lunar New Year’s eve while getting completely overwhelmed.

https://yugeten.github.io/posts/2025/01/ppogrpo/

Train your own R1 reasoning model locally (GRPO)

You can now reproduce your own DeepSeek-R1 reasoning model with Unsloth 100% locally. Using GRPO. Open-source, free and beginner friendly.

https://unsloth.ai/blog/r1-reasoning

Implementation

Google Colab

https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb

Dr. GRPO without normalization

8x A100 GPUs for 27 hours → 7B model achieves 43.3% on AIME 2024 (Zero-RL SOTA)

Mean reward:

Raw advantage: (simplified) or (expectation 0 for small batch due to the leave one out baseline)

8x A100 GPUs for 27 hours → 7B model achieves 43.3% on AIME 2024 (Zero-RL SOTA)

Group size: 8

Learning rate: 1e-6

Without KL term ()

Temperature 1

arxiv.org

https://arxiv.org/pdf/2503.20783

DeGRPO (Decoupled Group Relative Policy Optimization) for cost optimization with short answer

www.arxiv.org

https://www.arxiv.org/pdf/2505.13379

RL with
Verifiable Reward

The loss combines binary verification rewards and KL regularization, reformulating it as a weighted contrastive loss using old policy samples. This means good samples receive high scores while poor samples receive low scores. The optimal policy πₙ can be explicitly expressed in terms of reference policy π₀, previous policy πₙ₋₁, and success probability statistics pₙ₋₁, establishing a functional relationship pₙ = h(pₙ₋₁) that converges to a fixed point p*. Furthermore, it can be mathematically proven that the fixed point p* is always greater than the initial success probability p₀, demonstrating that GRPO effectively increases success probability through iterations. Additionally, in actual parametric policies (e.g., gradient descent), if statistical and approximation errors are small, the success probability is guaranteed to remain near p*.

arxiv.org

https://arxiv.org/pdf/2503.06639v1

GRPO

Group Relative Policy Optimization

Multi policy-group based update

Reference Policy

GRPO Surrogate Loss & KL penalty

Normalization terms

Conclusion

Limitation

DRPO

Appendix is awesome for
Language Model RL

Implementation

Dr. GRPO without normalization

RL with
Verifiable Reward

Backlinks

Recommendations

GRPO

Group Relative Policy Optimization

Multi policy-group based update

Reference Policy

GRPO Surrogate Loss & KL penalty

Normalization terms

Conclusion

Limitation

DRPO

Appendix is awesome for Language Model RL

Implementation

Dr. GRPO without normalization

RL with Verifiable Reward

Backlinks

Recommendations

Appendix is awesome for
Language Model RL

RL with
Verifiable Reward