RLVR
Programmatic Reward, Verifiable Feedback
Externally provided feedback that can be objectively validated against specific criteria.
For Reasoning Data with GRPO
Rule-based Verifiable Reward. LLM itself is policy network.
- Accuracy rewards: Checking if the model’s final answer is correct
- Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by
<think>
and</think>
)
One approach to reinforcement learning involves generative and discriminative models, such as GAN. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with Verifiable Reward, suggesting that larger units like code blocks or video clips should be gradually incorporated.
but, it has low Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller