RLVR
Programmatic Reward, Verifiable Feedback
Externally provided feedback that can be objectively validated against specific criteria.
Humans do not consider all paths in parallel, nor do they give bonus points to every step of the process just because the answer was correct. They should regret intermediate steps that were strange and give extra points to those that were helpful. In other words, a reflection process that synthesizes multiple trials is also necessary.
- ORM (Outcome Reward Model): A method that gives rewards based only on the final output.
- PRM (Process Reward Model): A method that evaluates and rewards each step.
For Reasoning Data with GRPO
Rule-based Verifiable Reward. LLM itself is policy network.
- Accuracy rewards: Checking if the model’s final answer is correct
- Format rewards: Incentivizing a structured chain-of-thought (enclose CoT tokens by
<think>and</think>)
One approach to reinforcement learning involves generative and discriminative models, such as GAN. Typical high-level AI development follows this approach and requires automation. While images can be compared visually, it's much harder to evaluate text, code, and audio. Therefore, a good AI coding assistant should not just provide results, but should help by breaking tasks down into smaller, easily verifiable steps. In other words, the importance of verifiability aligns with Verifiable Reward, suggesting that larger units like code blocks or video clips should be gradually incorporated.
Verifiable Rewards
Deepseek R1
However, it has low Sample efficiency in larger samples and only finds better reasoning paths within its existing capacity which makes its total problem solving coverage smaller

Seonglae Cho