Routing Replay

Language Model RL uses sequence rewards (rewards for the complete answer), but actual training is done at the token level. It is basically a

Contextual Bandit Model evaluating generated tokens in a given context, even though each token distribution's KL divergence makes them different. The two directions for resolving this problem in RL on LLMs are using a

Reward model at a granular level or, more fundamentally, a

World Model.

Stabilizing MoE like a 'dense model' by freezing expert routing during RL training

The token-level objective is a first-order approximation of the sequence objective, and the conditions for this first-order approximation to hold require that both of the following are small:

Training–Inference Discrepancy

Policy Staleness (the difference between the rollout policy and the learning policy)

arxiv.org

https://arxiv.org/pdf/2512.01374

Routing Replay

Recommendations