Verifiable Reward

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Mar 21 11:54
Editor
Edited
Edited
2025 Dec 29 16:20

RLVR

Programmatic Reward, Verifiable Feedback

Externally provided feedback that can be objectively validated against specific criteria.
Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.
  • ORM (Outcome Reward Model): A method that gives rewards based only on the final output.
  • PRM (Process Reward Model): A method that evaluates and rewards each step.

For Reasoning Data with
GRPO

Rule-based Verifiable Reward. LLM itself is policy network.
  • Accuracy rewards: Checking if the model’s final answer is correct
  • Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by <think> and </think>)
One approach to reinforcement learning involves generative and discriminative models, such as
GAN
. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with
Verifiable Reward
, suggesting that larger units like code blocks or video clips should be gradually incorporated.
Verifiable Rewards
 
 
 
 

Deepseek R1

However, it has low
Sample efficiency
in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
 
 
 

Recommendations