ScaleRL

To predict RL performance, we propose a mathematical model called the sigmoidal compute-performance curve:

While scaling laws (Kaplan et al., 2020) already exist for pre-training, the RL stage (e.g., PPO, DAPO, GRPO) has been unstable and unpredictable in terms of performance improvement (Reward gain) relative to compute.

Sigmoidal Compute–Performance Curve

Unlike pretraining's power law, RL has an upper bound (saturation point), which is demonstrated empirically

(Asymptotic Reward): Maximum performance limit

(Scaling Exponent): Efficiency — how quickly performance increases

: Compute required to reach half of maximum performance

This better explains the saturation phase in RL compared to the existing pretraining scaling law, which follows a power law (C^α) form.

Optimal Configuration Recipe

PipelineRL with 8-step off-policy training (
Off-policy
Replay Buffer)

CISPO loss function (truncated
Importance sampling REINFORCE)

Prompt-level loss aggregation (aggregate token losses to normalize for sequence length)

Batch-level advantage normalization (normalize advantages in a batch)

FP32 precision at LM logits

Zero-variance filtering & No-Positive-Resampling data curriculum (filter easy problems)

The Art of Scaling Reinforcement Learning Compute for LLMs

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training....

https://www.arxiv.org/abs/2510.13786

ScaleRL

Sigmoidal Compute–Performance Curve

Optimal Configuration Recipe

Recommendations