Verifiable Reward

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 21 11:54
Editor
Edited
Edited
2025 Apr 30 23:58

RLVR

Programmatic Reward, Verifiable Feedback

Externally provided feedback that can be objectively validated against specific criteria.

For Reasoning Data with
GRPO

Rule-based Verifiable Reward. LLM itself is policy network.
  • Accuracy rewards: Checking if the model’s final answer is correct
  • Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by <think> and </think>)
Verifiable Rewards
 
 
 
but, it has low
Sample efficiency
in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
 
 

Recommendations