GRPO

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 26 19:49
Editor
Edited
Edited
2025 Feb 19 21:6
notion image
 
 

Multi policy-group based update

Instead of using PPO's fixed clipping range, multiple outputs for the same question are grouped together to calculate the relative performance (advantage) of each sample based on the performance distribution within the group. Specifically, GRPO calculates the normalized advantage of each sample using the reward distribution from the reward model for multiple output groups for the same question without value model.
Using this calculated relative advantage, dynamic clipping is applied to reflect how much better or worse each output is compared to the group average instead of using
GAE
.
As a result, this improves learning stability and efficiency by updating well-performing outputs more significantly while applying smaller updates to outputs near the average.
In other words, while it's still
PPO
-like
On-policy
, it has more flexibility
 
 
 
 
 
 

Recommendations