RLVR
Programmatic Reward, Verifiable Feedback
Externally provided feedback that can be objectively validated against specific criteria.
For Reasoning Data with GRPO
Rule-based Verifiable Reward. LLM itself is policy network.
- Accuracy rewards: Checking if the model’s final answer is correct
- Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by
<think>
and</think>
)
Verifiable Rewards
but, it has low Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller