ScaleRL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 19 23:0
Editor
Edited
Edited
2025 Oct 19 23:28
To predict RL performance, we propose a mathematical model called the sigmoidal compute-performance curve:
While scaling laws (Kaplan et al., 2020) already exist for pre-training, the RL stage (e.g., PPO, DAPO, GRPO) has been unstable and unpredictable in terms of performance improvement (Reward gain) relative to compute.

Sigmoidal Compute–Performance Curve

Unlike pretraining's power law, RL has an upper bound (saturation point), which is demonstrated empirically
  • (Asymptotic Reward): Maximum performance limit
  • (Scaling Exponent): Efficiency — how quickly performance increases
  • : Compute required to reach half of maximum performance
This better explains the saturation phase in RL compared to the existing pretraining scaling law, which follows a power law (C^α) form.

Optimal Configuration Recipe

  • Prompt-level loss aggregation (aggregate token losses to normalize for sequence length)
  • Batch-level advantage normalization (normalize advantages in a batch)
  • FP32 precision at LM logits
  • Zero-variance filtering & No-Positive-Resampling data curriculum (filter easy problems)
 
 
 
 
 

Recommendations