Language Model RL

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Sep 9 17:8
Editor
Edited
Edited
2025 Dec 29 19:38
All
RLHF
-like language model RL methods have this to prevent
AI Reward Hacking
.
Language Model RL
uses sequence rewards (rewards for the complete answer), but actual training is done at the token level. It is basically a
Contextual Bandit Model
evaluating generated tokens in a given context, even though each token distribution's KL divergence makes them different. The two directions for resolving this problem in RL on LLMs are using a
Reward model
at a granular level or, more fundamentally, a
World Model
.
Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.
  • ORM (Outcome Reward Model): A method that gives rewards based only on the final output.
  • PRM (Process Reward Model): A method that evaluates and rewards each step.
Language Model RL Methods
 
Language Model Reward Benchmarks
 
 
 
Reinforcement Learning Transformers
 
 
Language Model RL Frameworks
 
 

Era of Experience

However, it has low
Sample efficiency
in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
SFT Memorizes, RL Generalizes (
AI Memory
,
Model Generalization
,
OOD
)
While the provocative title is not exactly correct, it provides insight even for Multimodality

Agent RL vulnerability

Search LLMs trained with agentic RL may appear safe, but can be easily jailbroken by manipulating the timing of the search step. The RL objective itself fails to suppress harmful queries, making "search first" behavior a critical vulnerability.
Information density (bits/sample) is very low in early training.
Sample efficiency
Supervised learning provides the correct answer for every token, always obtaining a lot of information, but RL only has high information when the probability of a correct answer is around 50%. In the early stages of RL, there are almost no correct samples, leading to extreme gradient variance.
 

Recommendations