Process Reinforcement through Implicit Rewards
🍃

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
: Project lead
: Core contributors
notion image
  • We present PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards, to advance reasoning abilities of language models beyond imitation or distillation.
  • With PRIME, starting from Qwen2.5-Math-7B-Base, our trained model Eurus-2-7B-PRIME achieves 26.7% pass@1 on AIME 2024, surpassing GPT-4o and Qwen2.5-Math-7B-Instruct. We achieve this with only 1/10 data of Qwen Math (230K SFT + 150K RL).
  • We also explore inference-time scaling and train EurusPRM, a SOTA-level math PRM that pushes the boundary even further.
  • Work in Progress. All models and data released. Code coming soon!
Tell me and I forget, teach me and I remember, involve me and I learn. — Benjamin Franklin

Introduction

Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.
Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.
While advanced reasoning of large language models (LLMs) is improvable through data-driven imitation, it creates fundamental scalability barriers - as better reasoning requires exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. We believe the key to overcoming such challenges lies in transforming data-driven approaches into exploration-based methods, as exemplified by reinforcement learning (RL). To this end, two critical challenges need to be addressed to bridge this transformation: (1) how to obtain precise reward signals efficiently and scalably, especially for dense ones? (2) how can we build effective RL algorithms to fully unleash the potential of these signals?
In this blog, we seek the scalable path towards advanced reasoning capabilities with efficient reward modeling and reinforcement learning.
Our recent study presented the implicit process reward modeling (PRM) objective. Without the need for any process label, implicit PRM is trained as an outcome reward model (ORM) and then used as a PRM. Inspired by this captivating property, we find that besides improving model performance through inference scaling, the true power of the implicit PRM is unveiled in online RL training. Specifically, it brings three benefits to RL:
  • Dense Reward: Implicit PRM directly learns a Q-function that provides rewards for each token, which alleviates the reward sparsity issue without the need of an extra value model.
  • Scalability: Implicit PRM can be online updated with only outcome label. Therefore, we can directly update the PRM with on-policy rollouts given outcome verifiers, which mitigates the distribution shift as well as scalability issues for PRMs.
  • Simplicity: Implicit PRM is inherently a language model. In practice, we show that it is unnecessary to train a PRM beforehand, since the SFT model itself already serves as a strong starting point.
We then dive into RL to figure out its key algorithm designs and implementation techniques. To this end, we present Process Reinforcement through IMplicit rEwards, PRIME, which effectively incorporates and updates PRMs in RL.
As an intermediate result, through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to 16.7% improvement on average, and over 20% on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. We then train a PRM with the implicit PRM objective for inference-time scaling, which further boosts the models’s reasoning capability.
The evaluation results of the opening figure are detailed below:
Eurus-2-7B-PRIME
Eurus-2-7B-SFT
Qwen-2.5-Math-7B-Instruct
Llama-3.1-70B-Instruct
GPT-4o
AIME 2024
26.7 (+23.3)
3.3
13.3
16.73
9.3
MATH-500
79.2 (+14.1)
65.1
79.8
64.6
76.4
AMC
57.8 (+27.7)
30.1
50.6
30.1
45.8
Minerva Math
38.6 (+5.9)
32.7
34.6
35.3
36.8
OlympiadBench
42.1 (+12.3)
29.8
40.7
31.9
43.3
Avg.
48.9 (+ 16.7)
32.2
43.8
35.7
43.3
We achieve this with only 1/10 data resources compared with Qwen-Math. The following is a comparison of resource requirements between Eurus-2-7B-PRIME and Qwen2.5-Math-7B-Instruct.
Eurus-2-7B-PRIME
Qwen2.5-Math-7B-Instruct
Base Model
Qwen2.5-Math-7B
Qwen2.5-Math-7B
SFT Data
230K (open-source)
2.5M (open-source and in-house)
RM Data
0
618K (in-house)
RM
Eurus-2-7B-SFT
Qwen2.5-Math-RM (72B)
RL Data
150K queries 4 samples
66K queries 32 samples
This blog will introduce:
  • The implicit process reward modeling objective and why it’s advantageous for PRM&RL
  • The PRIME algorithm which incorporates implicit process reward into online RL
  • The full recipe to build a strong reasoning model Eurus-2-7B-PRIME
  • How we further enhanced its performance by inference-time scaling with EurusPRM
We release all the models and data used in this research.
SFT Data
SFT Model
PRM Data
PRM
RL Data
PRIME Model

Preparation & Imitation Warmup

Models and Evaluation Datasets

We select Qwen2.5-Math-7B-Base as the starting point for its great mathematical capabilities.
For evaluation, we primarily adopt competition-level mathematics and programming benchmarks, as well as several commonly used datasets, including AIME 2024, AMC, MATH-500, Minerva Math, OlympiadBench, LeetCode and LiveCodeBench(v2).

Imitation Learning

We first performed supervised finetuning on the base model to get a starter model for RL.

Action-centric chain-of-thought reasoning

We applied imitation learning (supervised finetuning) as a warmup stage to teach models to learn certain reasoning patterns. To this end, we first designed an action-centric chain-of-thought reasoning framework, where the policy model chooses one of 7 actions at each step and stops after executing each action.

SFT dataset construction

To construct the SFT dataset, we collected reasoning instructions from several open-source datasets. It is noteworthy that we did not include many datasets with ground-truth answers in SFT even though they are of higher quality, but reserved them for the later RL training. The reason is that we aim to use different datasets for SFT and RL to diversify the exploration in RL, and we consider ground-truth more essential in RL than in SFT. For completion, we employ LLaMA-3.1-70B-Instruct to answer the instructions, with a system prompt requesting the model to perform action-centric chain-of-thought.
We finally obtained 230K SFT data, the detailed sources and statistics can be found in Appendix.

SFT results

After finetuning, the performance of our SFT model is reported in the starting figure.
Compared with Qwen2.5-Math-7B-Instruct, our SFT model lags behind it on all mathematics benchmarks.

Process Reward Models

Implicit PRM: Free Process Rewards without Process Labels

notion image
We adopt Implicit PRM, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
Proposition: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. . Define . is the exponential average of at step .
Hence, represents an exact expectation of outcome reward at step , i.e., the Q value.
The proposition indicates that when modeling to train an ORM with the standard pipeline, where is a hyperparameter, can implicitly learn a Q function. Hence, process reward can be obtained by:
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
 
The proposition is agnostic to specific choices of the training objective of ORMs. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the with . For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:

Reinforcement Learning

Our goal is clear and focused: to extensively leverage reinforcement learning (RL) to enhance reasoning capabilities. Aiming at the best practices of such a paradigm with limited resources, our key insights can be summarized below:

Pilot Study on Algorithms and Data

RL Data Collection & Preprocessing

We curated a high-quality RL training dataset of mathematics and coding problems with outcome verifiers (LaTeX answers for math and test cases for coding).
  • For math, we sourced from NuminaMath-CoT, which contains about 860K math problems. The problems span from Chinese high school mathematics to International Mathematical Olympiad competition questions.
To further increase data quality, we conducted detailed cleaning and filtering. Detailed data preprocessing can be found in Appendix. Finally, we retain 457k math problems and 27k coding problems.

Online Prompt Filtering

During the rollout stage, we find that choosing appropriate prompts matters a lot, especially only preserving the prompts among a certain difficulty range. Inspired by Qwen-2.5-Math, which filtered prompts according to the accuracy of the initial policy model beforehand, we perform online prompt filtering throughout the training. We sample multiple trajectories for each prompt, then calculate the accuracy and preserve the prompts with accuracy scores within a certain range. This also balanced the training data distribution for PRM update.
We conducted experiments validating this prompt filtering strategy. We sampled 4 trajectories for each prompt and set the range as , which means we discard both the prompts that are too easy and too hard. We plot the training rewards in the figure below.
notion image
From the results, we can see that online prompt filter largely lowers the variance of RL training.

RL Algorithms

We compared different online RL algorithms including PPO, REINFORCE, RLOO, GRPO, and ReMax . We implemented them with veRL and conducted pilot experiments with outcome verifiers as rewards. Specifically, the ground truth outcome rewards are defined as:
For these preliminary experiments, we began training with a fine-tuned Llama-3.1-8B model and report the results in Appendix. We find that REINFORCE-like algorithms, despite simpler than PPO, are strong enough to produce stable results. We choose the best performing RLOO as our RL algorithm. Note that we only adopt the advantage/return estimation function of RLOO, and use PPO policy loss with importance sampling and value clipping for training stability.

PRIME: Reinforcement Learning with PRM

Integrating PRMs into (online) reinforcement learning is not trivial, and poses several critical challenges to solve. Here we present the key challenges and how we solved them with Implicit PRM.
🤔
💡
 
🤔
💡
 
🤔
💡

PRIME Algorithm

We describe our final algorithm in this section. First, we illustrate the full cycle of PRIME with animation.
notion image
 
The policy model and PRM are both initialized with the SFT model. For each RL iteration, the policy model first generates rollouts. Then, the implicit PRM and outcome verifier score the rollouts, and the implicit PRM get updated on the rollouts with outcome reward. Finally, the outcome reward and process reward are combined and used to update the policy model.

Implementation

We present pseudo code here:
notion image
The algorithm flow includes:
  1. Prompt filtering based on policy model performance, only preserving those on which the policy model achieves a accuracy between 0.2 and 0.8.
  1. Calculate implicit process reward .
  1. Update Implicit PRM based on predicted implicit process reward and ground truth outcome label .
  1. Advantage estimation with RLOO. Specifically, we first calculate the return of outcome rewards and implicit process rewards separately: - For ground truth outcome rewards, we directly adopt RLOO without any modification. - For implicit process rewards, we perform a three-step process to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline. (2) Normalize the process reward at step by subtracting the baseline; (3) Calculate the discounted return for each response. Finally, advantage is set to the combination of both returns.
  1. Update the policy using PPO loss for legit importance sampling.

Experiments

Settings

By default, we initialize the implicit PRM with SFT model and retain the SFT model for reference logprobs. For hyperparameters, we use a constant 5e-7 learning rate together with AdamW optimizer for policy model, and use 1e-6 learning rate for PRM. Both policy and PRM use a mini batchsize of 256 and micro batchsize of 8. The rollout stage collects 256 prompts and samples 4 responses for each prompt. We set for PRM training. We set KL coefficient to 0 in all experiments.

Main Results

We first present the effect of dense rewards in reinforcement learning. Here we compare PRIME with RLOO w/ outcome verifier (OV) only, which means there are only ground truth outcome rewards for each trajectory. We trained this model for 240 steps. For PRIME, we use the same setting and trained the model for 592 steps. We plot the training rewards measured by outcome verifier and test accuracy in the following figures. Compared with sparse reward, PRIME accelerates RL training to 2.5 and improves the final rewards by 6.9%, with lower variances. On downstream tasks, PRIME also consistently outperforms OV only setup.
Training outcome rewards. For fair comparison, we cut the training steps at 240.
Training outcome rewards. For fair comparison, we cut the training steps at 240.
Test accuracy comparision.
Test accuracy comparision.
We list detailed results below. We can see that at the same 240 step, model trained by PRIME is generaly better than model trained by outcome rewards, leading to a 4 point performance gap. PRIME could further enhance model with more training steps.

Effect of Online PRM

We introduced online PRM, which updates with policy model rollouts and their corresponding verifier outcomes. Here we demonstrate the importance of online updates for PRMs. We compare two settings, where the online PRM is initialized by Eurus-2-7B-SFT and the offline PRM is EurusPRM-Stage1. From the figures below, We can see that, online PRM outperforms offline PRM by a large margin on both training and test sets.
notion image
notion image

Effect of Reference Policy

We implement two variants of our algorithms to explore the effect of reference policy, one using the initial SFT model as reference model while the other using the running policy’s old logprobs as reference, as shown in the figures below. The left one (policy ref) simply adopts the old logprob of policy model as , while the rights one (SFT ref) remains the initial SFT model for an additional calculation. We compare their performance in this section.
notion image
From the training rewards and test accuracy, we find the two strategies are close, and they have pros and cons in different aspects: Policy ref only needs two models in RL training, while SFT ref requires one more reference model. On the other hand, KL divergence calculation is only allowed when the initial SFT model is retained.

Single-Forward v.s. Double-Forward

Since our implicit PRM is concurrently updated in training, for each rollout stage, we can update PRM before policy model and use the updated PRM to re-calculate the process rewards, which we call the double-forward setting. We investigate the impact of double-forward in both training and test phase. Our default setting applies single-forward, which uses process rewards from old PRMs. We plot PRM accuracy on rollouts and training rewards below.
Accordingly, we find that double-forward could increase PRM accuracy, but the training rewards remain close between the two methods.
We also compare the average testset accuracy of single and double-forward. Their performances are also close. Single double-forward brings more computation overhead, we recommend single-forward setting in practice.

Inference Scaling with Implicit PRM

Despite RL, implicit PRM could further scale inference-time computation through Best-of-N sampling. In this section, we present EurusPRM, a SOTA-level open-source PRM for Best-of-N sampling.

PRM Training

We introduce a two-stage training pipeline upon Qwen2.5-Math-7B-Instruct for EurusPRM. We collected instructions with ground truth and employ Qwen2.5-Math-7B-Base, Llama-3.1-8B-Base/Instruct, Llama-3.1-70B-Instruct, Qwen2.5-72B-Instruct, and our SFT model to sample rollouts. Training datasets statistics can be found in Appendix.
Stage 1: Training on Complete Response-level Rollouts
We applied the above to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
Stage 2: Training on Manufactured Partial Step-level Pairs
We started the second-stage training on top of the first-stage model with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with with a learning rate of 5e-7 and a batch-size of 64.

PRM Evaluation

Evaluation Base Model

We adopt Eurus-2-7B-SFT, Qwen2.5-7B-Instruct and Llama-3.1-70B-Instruct as generation models to evaluate the performance of our implicit PRM. For all models, we set the sampling temperature as 0.5, p of the top-p sampling as 1.

Best-of-N Sampling

We use Best-of-64 as our evaluation metric. The weighting methods are different for several PRMs below.
  • For EurusPRM-Stage 1, we use the minimum reward across all steps.
  • For EurusPRM-Stage 2, we use the accumulative rewards.
Eurus-2-7B-SFT
Llama-3.1-70B-Instruct
Qwen2.5-7B-Instruct
 

Appendix

SFT Data & Training Details

The SFT data statistics are as follows:

Training Details

The following hyperparameters were used during training:

RL Data Preprocessing

Data Filtering and Question-Type Classification

The preprocessing pipeline employs a systematic rule-based approach to filter and classify mathematical problems to create a high-quality dataset with solvable problems, appropriate difficulty levels, and correct solutions.
We exclude problems containing figures or diagrams since they require visual processing capabilities. We also remove proof questions due to difficulties in answer verification. The remaining problems are classified into question-answering, multiple-choice, or fill-in-the-blank questions based on specific patterns. Since fill-in-the-blank questions comprise less than 400 examples compared to the much larger set of multiple-choice questions, we focus solely on multiple-choice questions for further processing.

Converting to Direct Question-Answer Format

We transform multiple-choice questions into a direct question-answer format through three sequential stages: rule-based filtering, LLM-based filtering, and LLM-based formatting.
We first identify and remove questions that inherently require multiple-choice options - specifically, those where comparing specific statements or properties is essential to the problem-solving process. These questions cannot be meaningfully converted to a direct question-answer format. The initial filtering employs simple rule-based pattern matching, searching for keywords like "following" and "statement" that typically indicate option-dependent problems.
Following the rule-based filtering, we employ Llama-3.1-8B-Instruct to perform a more nuanced classification of the remaining questions. Our pilot study revealed that while the LLM occasionally misclassifies questions, it tends to err on the conservative side - marking potentially convertible questions as requiring options rather than the reverse. Given our large dataset, we accepted this conservative approach to maintain quality.
For questions classified as convertible, we implement a two-phase reformatting process:
  1. Question Reformatting: Removing choice indicators and restructuring the question to elicit direct answers
  1. Solution Reformatting: Converting multiple-choice solutions into step-by-step derivations, ensuring all final answers are presented in standard LaTeX boxed format
This systematic approach maintains mathematical rigor while creating a standardized format suitable for downstream applications.

Problem and Solution Validation

The final stage involves merging all question-answer pairs and performing LLM-based comprehensive validation. We identify two key aspects in validation: solvability and correctness.
We leverage state-of-the-art mathematical reasoning models, including QwQ-32B-Preview and Qwen2.5-Math-72B-Instruct, employing a self-consistency approach to determine problem solvability, and if solvable, verify the correctness of solutions provided in the original dataset.
To enhance validation accuracy, we first analyzed sample problems to identify characteristics of solvable and unsolvable cases and created synthetic unsolvable problems featuring missing conditions or logical contradictions. Based on these samples, we developed specialized prompts to improve the models' ability to distinguish solvability.
Each problem undergoes five independent validation attempts, where the LLM:
We evaluate two key consistency measures across multiple validation attempts:
  • Status Consistency: Agreement on problem solvability
  • Answer Consistency:
    The final dataset retains only problems that demonstrate:
    • Consistent solvability across validation attempts
    • Agreement in solutions across multiple attempts
    • Alignment with ground truth answers
    This rigorous validation process ensures the resulting dataset comprises well-defined, solvable problems with verified, accurate solutions.

    PRM Data

    Stage 1

    The dataset statistics of Stage 1 Training are listed below:

    Stage 2

    The dataset statistics of Stage 2 Training are listed below:

    Other Results

    Results of Different RL Algorithms

    The results of different RL algorithms on Llama-3.1-8B are listed below. Since we used a different base model and dataset for the pilot study, the benchmarks used here are slightly different from the main experiments.
     

    Citation

    If you find PRIME or ImplicitPRM helpful, please cite them.
     

    Recommendations