Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.
- ORM (Outcome Reward Model): A method that gives rewards based only on the final output.
- PRM (Process Reward Model): A method that evaluates and rewards each step.
Per-Token Rewards
Per-Chunk Rewards
CRL: Control Reinforcement Learning
The token-level objective is a first-order approximation of the sequence objective, and the conditions for this first-order approximation to hold require that both of the following are small:
- Training–Inference Discrepancy
- Policy Staleness (the difference between the rollout policy and the learning policy)
Dataset
In reasoning training, data distribution may be more important than correctness. SFT with incorrect CoT generated by a stronger model can outperform human-written perfectly correct CoT. The reasons are distribution proximity: synthetic CoT is closer to the student model's output distribution, making it easier to learn, and partial validity: incorrect CoT still contains many correct intermediate reasoning steps.
- Paraphrasing human CoT with LLM improves performance → proves distribution effect
- Gradually increasing error rate leads to gradual performance decline → demonstrates error tolerance
In other words, final answer accuracy is not a reliable indicator of CoT quality. In reasoning SFT data curation, "distribution alignment" is a more critical variable than correctness

Seonglae Cho