All RLHF-like language model RL methods have this to prevent AI Reward Hacking.
Language Model RL uses sequence rewards (rewards for the complete answer), but actual training is done at the token level. It is basically a Contextual Bandit Model evaluating generated tokens in a given context, even though each token distribution's KL divergence makes them different. The two directions for resolving this problem in RL on LLMs are using a Reward model at a granular level or, more fundamentally, a World Model.
Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.
- ORM (Outcome Reward Model): A method that gives rewards based only on the final output.
- PRM (Process Reward Model): A method that evaluates and rewards each step.
Language Model RL Methods
Language Model Reward Benchmarks
Reinforcement Learning Transformers
Language Model RL Frameworks
Pre-training, mid-training, and RL post-training each contribute to "reasoning capability" in different ways, which can be causally decomposed in a fully controlled (synthetic) setting. The approach uses DAG to create reasoning structures and templates (e.g., animal-zoo, teacher-school) to vary only surface context. A process-verified evaluation is used by parsing the model's output CoT and comparing the predicted graph vs. the ground truth graph.
Generalization type
- Extrapolative (Depth): Can the model solve problems when the number of operations (reasoning steps) goes deeper than the training range?
- Contextual (Breadth): Does the same reasoning structure transfer when surface context changes?
The conditions under which RL produces 'true capability gain' are narrow. For in-distribution (ID) problems already well-covered by pre-training, RL increases pass@1 but barely improves pass@128 → this is closer to "sharpening" existing capabilities. Conversely, when pre-training leaves headroom and RL data targets the edge of competence (where pass@1 fails but pass@k shows some success), substantial expansion occurs with pass@128 improvements even on OOD. RL on problems that are too easy (ID) or too hard (complete OOD-hard) doesn't train well. In other words, easy problems get sharpened and difficult problems gain capability.
Contextual generalization requires a 'minimal seed'. For new long-tail context (B), if pre-training exposure is 0%~0.1%, RL also struggles to transfer. However, if context B is included in pre-training even sparsely (e.g., ≥1%) at the atomic primitive level, RL amplifies that seed to create strong transfer (large pass@128 gains). RL struggles to create something from nothing but excels at scaling up small foundations (seeds).
Mid-training is the 'bridge' that significantly affects RL efficiency. When varying the ratio (β) between mid-training and RL with the same compute budget, characteristics diverge: OOD-edge (moderately difficult domain) performance (especially pass@1) is better when mid-training proportion is high and RL is light. OOD-hard (much harder domain) generalization improves when RL proportion is increased. Mid-training lays down priors/representations; RL expands exploration and composition on top of that.
PRM reduce AI Reward Hacking and improve performance. Using only outcome rewards (correctness) easily leads to "shortcuts/dishonest reasoning that just gets the answer right." Mixing in dense rewards based on process verification (α mix), or only giving outcome rewards when the process is correct, reduces structural errors (dependency mismatches, etc.) and improves pass@1 and some pass@128 on OOD.
Era of Experience
However, it has low Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
SFT Memorizes, RL Generalizes (, Model Generalization, )
While the provocative title is not exactly correct, it provides insight even for Multimodality
Agent RL vulnerability
Search LLMs trained with agentic RL may appear safe, but can be easily jailbroken by manipulating the timing of the search step. The RL objective itself fails to suppress harmful queries, making "search first" behavior a critical vulnerability.
Information density (bits/sample) is very low in early training. Sample efficiency Supervised learning provides the correct answer for every token, always obtaining a lot of information, but RL only has high information when the probability of a correct answer is around 50%. In the early stages of RL, there are almost no correct samples, leading to extreme gradient variance.

Seonglae Cho
