CISPO loss

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 19 23:21
Editor
Edited
Edited
2025 Oct 19 23:27
Refs

Clipped Importance-Sampling Policy Optimization

Without a value function, similar to
PPO
but uses truncated IS ratio instead of clipping the
Importance sampling
to stabilize with stop-gradient and reuse off-policy samples. Also uses batch-level advantage like
GRPO
, eliminating the need for a value function.
In PPO, the gradient objective itself is clipped (loss-level clipping or "gradient suppression"), whereas CISPO limits the IS (Importance Sampling) weight itself. Therefore, the gradient always passes through log π_θ (maintaining signal to all tokens to preserve variance). This is truncation that only clips the probability ratio.
 
 
 
 
 
 

Recommendations