Clipped Importance-Sampling Policy Optimization
Without a value function, similar to PPO but uses truncated IS ratio instead of clipping the Importance sampling to stabilize with stop-gradient and reuse off-policy samples. Also uses batch-level advantage like GRPO, eliminating the need for a value function.
In PPO, the gradient objective itself is clipped (loss-level clipping or "gradient suppression"), whereas CISPO limits the IS (Importance Sampling) weight itself. Therefore, the gradient always passes through
log π_θ (maintaining signal to all tokens to preserve variance). This is truncation that only clips the probability ratio.
Seonglae Cho