Verifiable Reward

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 21 11:54
Editor
Edited
Edited
2025 Jun 6 16:26

RLVR

Programmatic Reward, Verifiable Feedback

Externally provided feedback that can be objectively validated against specific criteria.

For Reasoning Data with
GRPO

Rule-based Verifiable Reward. LLM itself is policy network.
  • Accuracy rewards: Checking if the model’s final answer is correct
  • Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by <think> and </think>)
One approach to reinforcement learning involves generative and discriminative models, such as
GAN
. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with
Verifiable Reward
, suggesting that larger units like code blocks or video clips should be gradually incorporated.
 
 
 
but, it has low
Sample efficiency
in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller
 
 

Recommendations