Per-Token Reward

Creator

Seonglae Cho

Created

2025 Dec 4 0:13

Editor

Seonglae Cho

Edited

2025 Dec 4 0:24

Refs

CRL: Control Reinforcement Learning

openreview.net

https://openreview.net/pdf?id=jiPrwmMb2e

The token-level objective is a first-order approximation of the sequence objective, and the conditions for this first-order approximation to hold require that both of the following are small:

Training–Inference Discrepancy

Policy Staleness (the difference between the rollout policy and the learning policy)

arxiv.org

https://arxiv.org/pdf/2512.01374

Recommendations

///////