Routing Replay

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Dec 4 0:7
Editor
Edited
Edited
2025 Dec 29 16:13
Refs
Refs
Language Model RL
uses sequence rewards (rewards for the complete answer), but actual training is done at the token level. It is basically a
Contextual Bandit Model
evaluating generated tokens in a given context, even though each token distribution's KL divergence makes them different. The two directions for resolving this problem in RL on LLMs are using a
Reward model
at a granular level or, more fundamentally, a
World Model
.
Stabilizing MoE like a 'dense model' by freezing expert routing during RL training
 
 
 
The token-level objective is a first-order approximation of the sequence objective, and the conditions for this first-order approximation to hold require that both of the following are small:
  • Training–Inference Discrepancy
  • Policy Staleness (the difference between the rollout policy and the learning policy)
 
 

Recommendations