All RLHF-like language model RL methods have this to prevent AI Reward Hacking.
Language Model RL uses sequence rewards (rewards for the complete answer), but actual training is done at the token level. It is basically a Contextual Bandit Model evaluating generated tokens in a given context, even though each token distribution's KL divergence makes them different. The two directions for resolving this problem in RL on LLMs are using a Reward model at a granular level or, more fundamentally, a World Model.
Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.
- ORM (Outcome Reward Model): A method that gives rewards based only on the final output.
- PRM (Process Reward Model): A method that evaluates and rewards each step.
Language Model RL Methods
Language Model Reward Benchmarks
Reinforcement Learning Transformers
Language Model RL Frameworks
Era of Experience
However, it has low Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
While the provocative title is not exactly correct, it provides insight even for Multimodality
Appendix is awesome for Language Model RL
Agent RL vulnerability
Search LLMs trained with agentic RL may appear safe, but can be easily jailbroken by manipulating the timing of the search step. The RL objective itself fails to suppress harmful queries, making "search first" behavior a critical vulnerability.
Information density (bits/sample) is very low in early training. Sample efficiency Supervised learning provides the correct answer for every token, always obtaining a lot of information, but RL only has high information when the probability of a correct answer is around 50%. In the early stages of RL, there are almost no correct samples, leading to extreme gradient variance.

Seonglae Cho
