Verifiable Reward

RLVR

Programmatic Reward, Verifiable Feedback

Externally provided feedback that can be objectively validated against specific criteria.

For Reasoning Data with
GRPO

Rule-based Verifiable Reward. LLM itself is policy network.

Accuracy rewards: Checking if the model’s final answer is correct

Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by <think> and </think>)

One approach to reinforcement learning involves generative and discriminative models, such as

GAN. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with

Verifiable Reward, suggesting that larger units like code blocks or video clips should be gradually incorporated.

Verifiable Rewards

Deepseek R1

arxiv.org

https://arxiv.org/pdf/2501.12948

From Zero to Reasoning Hero: How DeepSeek-R1 Leverages Reinforcement Learning to Master Complex Reasoning

A Blog post by Yihua Zhang on Hugging Face

https://huggingface.co/blog/NormalUhr/deepseek-r1-explained

However, it has low

Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller

limit-of-RLVR

LeapLabTHU • Updated 2025 Oct 25 2:10

www.arxiv.org

https://www.arxiv.org/pdf/2504.13837