CRL: Control Reinforcement Learning
The token-level objective is a first-order approximation of the sequence objective, and the conditions for this first-order approximation to hold require that both of the following are small:
- Training–Inference Discrepancy
- Policy Staleness (the difference between the rollout policy and the learning policy)

Seonglae Cho