CISPO loss

Clipped Importance-Sampling Policy Optimization

Without a value function, similar to

PPO but uses truncated IS ratio instead of clipping the

Importance sampling to stabilize with stop-gradient and reuse off-policy samples. Also uses batch-level advantage like

GRPO, eliminating the need for a value function.

In PPO, the gradient objective itself is clipped (loss-level clipping or "gradient suppression"), whereas CISPO limits the IS (Importance Sampling) weight itself. Therefore, the gradient always passes through log π_θ (maintaining signal to all tokens to preserve variance). This is truncation that only clips the probability ratio.

arxiv.org

https://arxiv.org/pdf/2506.13585v1

CISPO loss

Clipped Importance-Sampling Policy Optimization

Backlinks

Recommendations