To predict RL performance, we propose a mathematical model called the sigmoidal compute-performance curve:
While scaling laws (Kaplan et al., 2020) already exist for pre-training, the RL stage (e.g., PPO, DAPO, GRPO) has been unstable and unpredictable in terms of performance improvement (Reward gain) relative to compute.
Sigmoidal Compute–Performance Curve
Unlike pretraining's power law, RL has an upper bound (saturation point), which is demonstrated empirically
- (Asymptotic Reward): Maximum performance limit
- (Scaling Exponent): Efficiency — how quickly performance increases
- : Compute required to reach half of maximum performance
This better explains the saturation phase in RL compared to the existing pretraining scaling law, which follows a power law (C^α) form.
Optimal Configuration Recipe
- PipelineRL with 8-step off-policy training (Off-policy Replay Buffer)
- CISPO loss function (truncated Importance sampling REINFORCE)
- Prompt-level loss aggregation (aggregate token losses to normalize for sequence length)
- Batch-level advantage normalization (normalize advantages in a batch)
- FP32 precision at LM logits
- Zero-variance filtering & No-Positive-Resampling data curriculum (filter easy problems)

Seonglae Cho