PRM

Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.

ORM (Outcome Reward Model): A method that gives rewards based only on the final output.

PRM (Process Reward Model): A method that evaluates and rewards each step.

Per-Token Rewards

Gradient Coefficient Reward

Per-Chunk Rewards

RLMT

Agent Reinforcement Trainer

CRL: Control Reinforcement Learning

openreview.net

https://openreview.net/pdf?id=jiPrwmMb2e

The token-level objective is a first-order approximation of the sequence objective, and the conditions for this first-order approximation to hold require that both of the following are small:

Training–Inference Discrepancy

Policy Staleness (the difference between the rollout policy and the learning policy)

arxiv.org

https://arxiv.org/pdf/2512.01374

Dataset

In reasoning training, data distribution may be more important than correctness. SFT with incorrect CoT generated by a stronger model can outperform human-written perfectly correct CoT. The reasons are distribution proximity: synthetic CoT is closer to the student model's output distribution, making it easier to learn, and partial validity: incorrect CoT still contains many correct intermediate reasoning steps.

Paraphrasing human CoT with LLM improves performance → proves distribution effect

Gradually increasing error rate leads to gradual performance decline → demonstrates error tolerance

In other words, final answer accuracy is not a reliable indicator of CoT quality. In reasoning SFT data curation, "distribution alignment" is a more critical variable than correctness

arxiv.org

https://arxiv.org/pdf/2512.22255

PRM

CRL: Control Reinforcement Learning

Dataset

Recommendations