Clipped Importance-Sampling Policy Optimization
Without a value function, similar to PPO but uses truncated IS ratio instead of clipping the Importance sampling to stabilize with stop-gradient and reuse off-policy samples. Also uses batch-level advantage like GRPO, eliminating the need for a value function.
In PPO, the gradient objective itself is clipped (loss-level clipping or "gradient suppression"), whereas CISPO limits the IS (Importance Sampling) weight itself. Therefore, the gradient always passes through
log π_θ (maintaining signal to all tokens to preserve variance). This is truncation that only clips the probability ratio.arxiv.org
https://arxiv.org/pdf/2506.13585v1

Seonglae Cho