The core issue is solving the hard clipping problem of existing RL methods (GRPO, GSPO). In LLM RL, training becomes unstable when the importance ratio fluctuates significantly. Existing methods address this through clipping. However, the clipping problem is that even if only a few tokens are problematic, the entire sequence can be discarded. In MoE models, variance is particularly large, leading to instability. Instead of hard clipping, a smooth gating function is used. The more off-policy, the more gradually the gradient is reduced.
Qwen
Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
https://qwen.ai/blog?id=sapo

Seonglae Cho