Process Reinforcement through Implicit Rewards

Community Article Published January 3, 2025

Ganqu Cui ^{\dagger *}, Lifan Yuan ^{\dagger *}, Zefan Wang ^*, Hanbin Wang ^*, Wendi Li ^*, Bingxiang He ^*, Yuchen Fan ^*, Tianyu Yu ^*, Qixin Xu ^*, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding ^{\dagger}

\dagger : Project lead

^*: Core contributors

GitHub: https://github.com/PRIME-RL/PRIME

In this blog post, we introduce PRIME (Process Reinforcement through IMplicit REwards), a scalable RL solution for advanced reasoning through implicit process rewards. Our main contributions:

  • We present PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards, to advance reasoning abilities of language models beyond imitation or distillation.
  • With PRIME, starting from Qwen2.5-Math-7B-Base, our trained model Eurus-2-7B-PRIME achieves 26.7% pass@1, surpassing GPT-4o and Qwen2.5-Math-7B-Instruct. We achieve this with only 1/10 data of Qwen Math (230K SFT + 150K RL).
  • We also explore inference-time scaling and train EurusPRM, a SOTA-level math PRM that pushes the boundary even further.
  • Work in Progress. All models and data released. Code coming soon!

Tell me and I forget, teach me and I remember, involve me and I learn.
— Benjamin Franklin

Contents

Introduction

Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.

Figure: Our Eurus-2-7B-PRIME excels at competition-level mathematics benchmarks, outperforming advanced math models and larger models. Notably, PRIME brings substantial performance gain (+16.7%) for Eurus-2-7B-SFT.

While advanced reasoning of large language models (LLMs) is improvable through data-driven imitation, it creates fundamental scalability barriers - as better reasoning requires exponentially more high-quality examples to imitate, making continuous improvement increasingly intractable. We believe the key to overcoming such challenges lies in transforming data-driven approaches into exploration-based methods, as exemplified by reinforcement learning (RL). To this end, two critical challenges need to be addressed to bridge this transformation: (1) how to obtain precise reward signals efficiently and scalably, especially for dense ones? (2) how can we build effective RL algorithms to fully unleash the potential of these signals?

In this blog, we seek the scalable path towards advanced reasoning capabilities with efficient reward modeling and reinforcement learning.

Our recent study presented the implicit process reward modeling (PRM) objective. Without the need for any process label, implicit PRM is trained as an outcome reward model (ORM) and then used as a PRM. Inspired by this captivating property, we find that besides improving model performance through inference scaling, the true power of the implicit PRM is unveiled in online RL training. Specifically, it brings three benefits to RL:

  • Dense Reward: Implicit PRM directly learns a Q-function that provides rewards for each token, which alleviates the reward sparsity issue without the need of an extra value model.
  • Scalability: Implicit PRM can be online updated with only outcome label. Therefore, we can directly update the PRM with on-policy rollouts given outcome verifiers, which mitigates the distribution shift as well as scalability issues for PRMs.
  • Simplicity: Implicit PRM is inherently a language model. In practice, we show that it is unnecessary to train a PRM beforehand, since the SFT model itself already serves as a strong starting point.

We then dive into RL to figure out its key algorithm designs and implementation techniques. To this end, we present Process Reinforcement through IMplicit rEwards, PRIME, which effectively incorporates and updates PRMs in RL.

As an intermediate result, through PRIME, we successfully achieve substantial improvements on key reasoning benchmarks over our SFT version of the model, leading to 16.7% improvement on average, and over 20% on AMC&AIME competitions. Our final model Eurus-2-7B-PRIME, based on Qwen-2.5-Math-7B-Base, surpassed its instruct version on 5 key reasoning benchmarks. We then train a PRM with the implicit PRM objective for inference-time scaling, which further boosts the models’s reasoning capability.

The evaluation results of the opening figure are detailed below:

Eurus-2-7B-PRIME Eurus-2-7B-SFT Qwen-2.5-Math-7B-Instruct Llama-3.1-70B-Instruct GPT-4o
AIME 2024 26.7 (+23.3) 3.3 13.3 16.7 9.3
MATH-500 79.2 (+14.1) 65.1 79.8 64.6 76.4
AMC 57.8 (+27.7) 30.1 50.6 30.1 45.8
Minerva Math 38.6 (+5.9) 32.7 34.6 35.3 36.8
OlympiadBench 42.1 (+12.3) 29.8 40.7 31.9 43.3
Avg. 48.9 (+ 16.7) 32.2 43.8 35.7 43.3

We achieve this with only 1/10 data resources compared with Qwen-Math. The following is a comparison of resource requirements between Eurus-2-7B-PRIME and Qwen2.5-Math-7B-Instruct.

Eurus-2-7B-PRIME Qwen2.5-Math-7B-Instruct
Base Model Qwen2.5-Math-7B Qwen2.5-Math-7B
SFT Data 230K (open-source) 2.5M (open-source and in-house)
RM Data 0 618K (in-house)
RM Eurus-2-7B-SFT Qwen2.5-Math-RM (72B)
RL Data 150K queries ×\times4 samples 66K queries ×\times 32 samples

This blog will introduce:

  • The implicit process reward modeling objective and why it’s advantageous for PRM&RL
  • The PRIME algorithm which incorporates implicit process reward into online RL
  • The full recipe to build a strong reasoning model Eurus-2-7B-PRIME
  • How we further enhanced its performance by inference-time scaling with EurusPRM

We release all the models and data used in this research.

Preparation and Imitation Warmup

Models and Evaluation Datasets

We select Qwen2.5-Math-7B-Base as the starting point for its great mathematical capabilities.

For evaluation, we primarily adopt competition-level mathematics and programming benchmarks, as well as several commonly used datasets, including AIME 2024, AMC, MATH-500, Minerva Math, OlympiadBench, LeetCode and LiveCodeBench(v2).

Imitation Learning

We first performed supervised finetuning on the base model to get a starter model for RL.

Action-centric chain-of-thought reasoning

We applied imitation learning (supervised finetuning) as a warmup stage to teach models to learn certain reasoning patterns. To this end, we first designed an action-centric chain-of-thought reasoning framework, where the policy model chooses one of 7 actions at each step and stops after executing each action.

SFT dataset construction

To construct the SFT dataset, we collected reasoning instructions from several open-source datasets. It is noteworthy that we did not include many datasets with ground-truth answers in SFT even though they are of higher quality, but reserved them for the later RL training. The reason is that we aim to use different datasets for SFT and RL to diversify the exploration in RL, and we consider ground-truth more essential in RL than in SFT. For completion, we employ LLaMA-3.1-70B-Instruct to answer the instructions, with a system prompt requesting the model to perform action-centric chain-of-thought.

We finally obtained 230K SFT data, the detailed sources and statistics can be found in Appendix.

SFT results

After finetuning, the performance of our SFT model is reported in the starting figure.

Compared with Qwen2.5-Math-7B-Instruct, our SFT model lags behind it on all mathematics benchmarks.

Process Reward Models

Implicit PRM: Free Process Rewards without Process Labels

image/png

We adopt Implicit PRM, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.

The key ingredient of Implicit PRM is the reward representation, as demonstrated below:

Proposition: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. rϕ(y):=βlogπϕ(y)πref(y)r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}. Define qϕt(y<t,yt):=i=1tβlogπϕ(yiy<i)πref(yiy<i)q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}. qθtq_\theta^t is the exponential average of rθr_\theta at step tt.

qϕt(y<t,yt)=βlogEπref(yyt)e1βrϕ(y) q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}

Hence, qϕtq_\phi^t represents an exact expectation of outcome reward rθr_\theta at step tt, i.e., the Q value.

The proposition indicates that when modeling rϕ(y):=βlogπϕ(y)πref(y)r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} to train an ORM with the standard pipeline, where β\beta is a hyperparameter, ϕ\phi can implicitly learn a Q function. Hence, process reward rϕtr_\phi^t can be obtained by:

rϕt:=qϕtqϕt1=βlogπϕ(yty<t)πref(yty<t) r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}

Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.

The proposition is agnostic to specific choices of the training objective of ORMs. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the rϕ(y)r_\phi \left( \mathbf{y} \right) with βlogπϕ(y)πref(y)\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}. For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:

LCE=llogσ(βlogπϕ(y)πref(y))+(1l)log[1σ(βlogπϕ(y)πref(y))] \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1-l) \cdot \log\left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]

Reinforcement Learning

Our goal is clear and focused: to extensively leverage reinforcement learning (RL) to enhance reasoning capabilities. Aiming at the best practices of such a paradigm with limited resources, our key insights can be summarized below:

  • Start from high-quality data with ground truth verifiers: We did rigorous data collection and cleaning to obtain verifiable RL data, and found that using outcome verifier only are already strong baselines.
  • Simple REINFORCE-like algorithms are surprisingly effective: We compared different RL algorithms and concluded that value model-free REINFORCE-like methods are powerful enough.
  • Use “mid-difficulty” problems for stabilized training: We proposed a mechanism named online prompt filter, which largely stabilized RL training by filtering out over difficult and simple questions.
  • Implicit process rewards push the boundary even further! We successfully integrated process rewards into online RL, and observed great training acceleration and performance improvement. The method is seamlessly accessible to everyone.

Pilot Study on Algorithms and Data

RL Data Collection & Preprocessing

We curated a high-quality RL training dataset of mathematics and coding problems with outcome verifiers (LaTeX answers for math and test cases for coding).

  • For math, we sourced from NuminaMath-CoT, which contains about 860K math problems. The problems span from Chinese high school mathematics to International Mathematical Olympiad competition questions.
  • For coding, we sourced from APPS, CodeContests, TACO, and Codeforces.

To further increase data quality, we conducted detailed cleaning and filtering. Detailed data preprocessing can be found in Appendix. Finally, we retain 457k math problems and 27k coding problems.

Online Prompt Filtering

During the rollout stage, we find that choosing appropriate prompts matters a lot, especially only preserving the prompts among a certain difficulty range. Inspired by Qwen-2.5-Math, which filtered prompts according to the accuracy of the initial policy model beforehand, we perform online prompt filtering throughout the training. We sample multiple trajectories for each prompt, then calculate the accuracy and preserve the prompts with accuracy scores within a certain range. This also balanced the training data distribution for PRM update.

We conducted experiments validating this prompt filtering strategy. We sampled 4 trajectories for each prompt and set the range as [0.2,0.8][0.2, 0.8], which means we discard both the prompts that are too easy and too hard. We plot the training rewards in the figure below.

image/png

From the results, we can see that online prompt filter largely lowers the variance of RL training.

RL Algorithms

We compared different online RL algorithms including PPO, REINFORCE, RLOO, GRPO, and ReMax . We implemented them with verl and conducted pilot experiments with outcome verifiers as rewards. Specifically, the ground truth outcome rewards are defined as:

romath(y)={1, if answer matched 0, if answer not matched  r_o^{\text{math}}(y)= \begin{cases}1, & \text { if answer matched } \\ 0, & \text { if answer not matched } \end{cases}

rocoding(y)=passed test casestest cases r_o^{\text{coding}}(y)= \frac{\sum \text{passed test cases}}{\sum \text{test cases}}

For these preliminary experiments, we began training with a fine-tuned Llama-3.1-8B model and report the results in Appendix. We find that REINFORCE-like algorithms, despite simpler than PPO, are strong enough to produce stable results. We choose the best performing RLOO as our RL algorithm. Note that we only adopt the advantage/return estimation function of RLOO, and use PPO policy loss with importance sampling and value clipping for training stability.

PRIME: Reinforcement Learning with PRM

Integrating PRMs into (online) reinforcement learning is not trivial, and poses several critical challenges to solve. Here we present the key challenges and how we solved them with Implicit PRM.

🤔How to provide dense rewards to reinforcement learning?

Reward sparsity has been a long-lasting problem in RL, as well as in RL for LLMs. Until now, we still have no widely accepted solutions to compose dense rewards in (online) RL for LLMs. Previous approaches mainly set up an additional value model for dense rewards, which is known to be hard to train and brings little performance gains. Therefore, it is unclear how can we incorporate process rewards into RL practices.

💡We seamlessly utilize process rewards for every token in advantage/return estimation.

Under our reward modeling objective rϕ(y):=βlogπϕ(y)πref(y)r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}, we can obtain process rewards at token-level from implicit PRMs for free. In this way, our PRM could directly replace the value model in PPO, making it extremely easy to combine with any advantage estimation functions and outcome rewards. In practice, we integrated process rewards with REINFORCE, RLOO, GRPO, ReMax, and PPO with minor modification.


🤔How to set up a good PRM to start RL?

Even if we find a path to use process rewards in RL, training good PRMs to start with is also non-trivial. Practitioners need to collect large-scale (process) reward data which is expensive and the model should achieve a good balance between generalization and distribution shift.

💡Start with your policy model as PRM.

Implicit PRM is inherently a language model. So theoretically, you can use any language model as the PRM. In practice, we find that the starting policy model itself serves as a great (if not the best) initialization of PRM. That means, you only need one model to start your RL journey! This makes RL with implicit PRMs unprecedentedly more accessible than ever before.


🤔How to update PRM online to prevent reward hacking?

In online RL, it is crucial that your RM is not overoptimized or hacked, which requires the RM to keep updating along with the policy model. However, given the expensiveness of step labels, it is difficult to update PRMs during RL training. This brought considerable scalability and generalization concerns in PRM for RL.

💡Implicit PRMs only demand outcome labels to update.

That is to say, with outcome verifiers, we can easily update our PRMs during training! In experiments, we illustrate the importance of online PRM. Moreover, we can also do double-forward, where we first update the PRM with on-policy rollouts, then re-calculate the process rewards with the updated PRM, and thus provide an even more accurate reward estimation.

PRIME Algorithm

We describe our final algorithm in this section. First, we illustrate the full cycle of PRIME with animation.

image/gif

The policy model and PRM are both initialized with the SFT model. For each RL iteration, the policy model first generates rollouts. Then, the implicit PRM and outcome verifier score the rollouts, and the implicit PRM get updated on the rollouts with outcome reward. Finally, the outcome reward ror_o and process reward rpr_p are combined and used to update the policy model.

Implementation

We present pseudo code here:

image/png

The algorithm flow includes:

  1. Prompt filtering based on policy model performance, only preserving those on which the policy model πθ\pi_\theta achieves a accuracy between 0.2 and 0.8.

  2. Calculate implicit process reward rtr^t.

  3. Update Implicit PRM πϕ\pi_\phi based on predicted implicit process reward rtr^t and ground truth outcome label rr.

  4. Advantage estimation with RLOO. Specifically, we first calculate the return of outcome rewards and implicit process rewards separately:

    • For ground truth outcome rewards, we directly adopt RLOO without any modification.

    • For implicit process rewards, we perform a three-step process to calculate return: (1) Use the averaged implicit process rewards to calculate the leave-one-out baseline. (2) Normalize the process reward at step tt by subtracting the baseline; (3) Calculate the discounted return for each response.

      Finally, advantage is set to the combination of both returns.

  5. Update the policy πθ\pi_\theta using PPO loss for legit importance sampling.

Experiments

Settings

By default, we initialize the implicit PRM with SFT model and retain the SFT model for reference logprobs. For hyperparameters, we use a constant 5e-7 learning rate together with AdamW optimizer for policy model, and use 1e-6 learning rate for PRM. Both policy and PRM use a mini batchsize of 256 and micro batchsize of 8. The rollout stage collects 256 prompts and samples 4 responses for each prompt. We set β=0.05\beta=0.05 for PRM training. We set KL coefficient to 0 in all experiments.

Main Results

We first present the effect of dense rewards in reinforcement learning. Here we compare PRIME with RLOO w/ outcome verifier (OV) only, which means there are only ground truth outcome rewards for each trajectory. We trained this model for 240 steps. For PRIME, we use the same setting and trained the model for 592 steps. We plot the training rewards measured by outcome verifier and test accuracy in the following figures. Compared with sparse reward, PRIME accelerates RL training to 2.5 ×\times and improves the final rewards by 6.9%, with lower variances. On downstream tasks, PRIME also consistently outperforms OV only setup.

Training outcome rewards. For fair comparison, we cut the training steps at 240.

Figure: Training outcome rewards. For fair comparison, we cut the training steps at 240.

Test accuracy comparision.

Figure: Test accuracy comparision.

We list detailed results below. We can see that at the same 240 step, model trained by PRIME is generaly better than model trained by outcome rewards, leading to a 4 point performance gap. PRIME could further enhance model with more training steps.

Method Step AIME 2024 AMC MATH-500 Minerva Math OlympiadBench LeetCode LiveCodeBench Math Avg. Avg.
Eurus-2-7B-SFT 0 3.3 30.1 66.2 32.7 29.8 21.7 17.8 32.2 28.8
RLOO w/ OV Only 240 20.0 47.0 73.2 36.4 35.4 28.3 26.7 42.2 36.9
PRIME 80 20.0 41.0 68.2 38.2 37.0 26.7 26.6 40.9 36.8
160 13.3 42.2 72.0 37.1 38.7 26.7 25.6 40.7 36.5
240 20.0 50.6 78.2 39.3 40.3 31.1 27.5 45.7 41.0
320 16.7 51.8 77.8 39.7 41.5 36.1 28.5 45.5 41.7
592 26.7 57.8 79.2 38.6 42.1 33.3 28.6 48.9 43.9

Effect of Online PRM

We introduced online PRM, which updates with policy model rollouts and their corresponding verifier outcomes. Here we demonstrate the importance of online update for PRMs. We compare two settings, where the online PRM is initialized by Eurus-2-7B-SFT and the offline PRM is EurusPRM-Stage1. From the figures below, We can see that, online PRM outperforms offline PRM by a large margin on both training and test sets.

image/png

image/png

Effect of Reference Policy

We implement two variants of our algorithms to explore the effect of reference policy, one using the initial SFT model as reference model while the other using the running policy’s old logprobs as reference, as shown in the figures below. The above one (policy ref) simply adopts the old logprob of policy model as πref\pi_{\text{ref}} , while the below one (SFT ref) remains the initial SFT model for an additional πref\pi_{\text{ref}} calculation. We compare their performance in this section.

Policy ref: We discard the reference policy and use the old logprob as  <span class=πref\pi_{\text{ref}} for PRM">

Figure: Policy ref, We discard the reference policy and use the old logprob as πref\pi_{\text{ref}} for PRM

SFT ref: We retrain the initial  policy to provide   <span class=πref\pi_{\text{ref}} for PRM and KL">

Figure: SFT ref, We retrain the initial policy to provide πref\pi_{\text{ref}} for PRM and KL

image/png

Step SFT Ref Policy Ref
80 36.8 36.7
160 36.5 38.4
240 41.0 40.5
320 41.7 41.0

From the training rewards and test accuracy, we find the two strategies are close, and they have pros and cons in different aspects: Policy ref only needs two models in RL training, while SFT ref requires one more reference model. On the other hand, KL divergence calculation is only allowed when the initial SFT model is retained.

Single-Forward v.s. Double-Forward

Since our implicit PRM is concurrently updated in training, for each rollout stage, we can update PRM before policy model and use the updated PRM to re-calculate the process rewards, which we call the double-forward setting. We investigate the impact of double-forward in both training and test phase. Our default setting applies single-forward, which uses process rewards from old PRMs. We plot PRM accuracy on rollouts and training rewards below.

image/png

image/png

Accordingly, we find that double-forward could increase PRM accuracy, but the training rewards remain close between the two methods.

We also compare the average testset accuracy of single and double-forward. Their performances are also close. Single double-forward brings more computation overhead, we recommend single-forward setting in practice.

Step Single-Forward Double-Forward
80 36.8 35.7
160 36.5 37.4
240 41.0 40.4
320 41.7 41.0

Inference Scaling with Implicit PRM

Despite RL, implicit PRM could further scale inference-time computation through Best-of-N sampling. In this section, we present EurusPRM, a SOTA-level open-source PRM for Best-of-N sampling.

PRM Training

We introduce a two-stage training pipeline upon Qwen2.5-Math-7B-Instruct for EurusPRM. We collected instructions with ground truth and employ Qwen2.5-Math-7B-Base, Llama-3.1-8B-Base/Instruct, Llama-3.1-70B-Instruct, Qwen2.5-72B-Instruct, and our SFT model to sample rollouts. Training datasets statistics can be found in Appendix.

Stage 1: Training on Complete Response-level Rollouts

We applied the above LCEL_{CE} to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.

Stage 2: Training on Manufactured Partial Step-level Pairs

We started the second-stage training on top of the first-stage model with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with LCEL_{CE} with a learning rate of 5e-7 and a batch-size of 64.

PRM Evaluation

Evaluation Base Model

We adopt Eurus-2-7B-SFT, Qwen2.5-7B-Instruct and Llama-3.1-70B-Instruct as generation models to evaluate the performance of our implicit PRM. For all models, we set the sampling temperature as 0.5, p of the top-p sampling as 1.

Best-of-N Sampling

We use Best-of-64 as our evaluation metric. The weighting methods are different for several PRMs below.

  • For Skywork-o1-Open-PRM-Qwen-2.5-7B, we use simple average reward across all steps.
  • For EurusPRM-Stage 1, we use the minimum reward across all steps.
  • For EurusPRM-Stage 2, we use the accumulative rewards.

Eurus-2-7B-SFT

Method Reward Model MATH AMC AIME_2024 OlympiadBench Minerva Math Avg
Greedy Pass @ 1 N/A 65.1 30.1 3.3 29.8 32.7 32.2
Majority Voting @ 64 N/A 65.6 53.0 13.3 39.1 22.4 38.7
Best-of-64 Skywork-o1-Open-PRM-Qwen-2.5-7B 47.2 45.8 10.0 32.3 16.2 30.3
EurusPRM-Stage 1 44.6 41.0 6.7 32.9 17.3 28.5
EurusPRM-Stage 2 47.2 43.4 13.3 33.8 19.2 31.4
Weighted Best-of-64 Skywork-o1-Open-PRM-Qwen-2.5-7B 64.6 55.4 13.3 41.3 23.2 39.6
EurusPRM-Stage 1 66.0 54.2 13.3 39.6 29.0 40.4
EurusPRM-Stage 2 66.0 54.2 13.3 39.7 29.0 40.4

Llama-3.1-70B-Instruct

Method Reward Model MATH AMC AIME 2024 OlympiadBench Minerva Math Avg
Greedy Pass @ 1 N/A 64.6 30.1 16.7 31.9 35.3 35.7
Majority Voting @ 64 N/A 80.2 53.0 26.7 40.4 38.6 47.8
Best-of-N @ 64 Skywork-o1-Open-PRM-Qwen-2.5-7B 77.8 56.6 23.3 39.0 31.6 45.7
EurusPRM-Stage 1 77.8 44.6 26.7 35.3 41.5 45.2
EurusPRM-Stage 2 80.6 59.0 20.0 37.6 44.9 48.4
Weighted Best-of-64 Skywork-o1-Open-PRM-Qwen-2.5-7B 81.2 56.6 23.3 42.4 38.2 48.3
EurusPRM-Stage 1 80.4 53.0 26.7 40.9 46.7 49.5
EurusPRM-Stage 2 80.4 53.0 26.7 41.0 46.3 49.5

Qwen2.5-7B-Instruct

Method Reward Model MATH AMC AIME 2024 OlympiadBench Minerva Math Avg
Greedy Pass @ 1 N/A 73.3 47.0 13.3 39.4 35.3 41.7
Majority Voting @ 64 N/A 82.0 53.0 16.7 43.0 36.4 46.2
Best-of-N @ 64 Skywork-o1-Open-PRM-Qwen-2.5-7B 85.2 60.2 20.0 44.7 32.7 48.6
EurusPRM-Stage 1 81.8 47.0 16.7 40.1 41.5 45.4
EurusPRM-Stage 2 86.0 59.0 16.7 41.4 41.5 48.9
Weighted Best-of-64 Skywork-o1-Open-PRM-Qwen-2.5-7B 83.6 55.4 13.3 43.7 36.8 46.6
EurusPRM-Stage 1 82.6 53.0 16.7 42.7 45.2 48.0
EurusPRM-Stage 2 84.8 53.0 16.7 43.2 45.6 48.7

Appendix

SFT Data and Training Details

The SFT data statistics are as follows:

Task Dataset Size Avg. Response Length Source
Math MathInstruct-MATH 12715 964.01 https://huggingface.co/datasets/TIGER-Lab/MathInstruct
OpenMathInstruct-2-Augmented_Math 15086 1202.25 https://huggingface.co/datasets/nvidia/OpenMathInstruct-2
Numina 55845 1331.61 https://huggingface.co/datasets/AI-MO/NuminaMath-CoT
reasoning-001 29831 1316.49 https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01
Coding Code-Feedback 27663 1805.16 https://huggingface.co/datasets/m-a-p/Code-Feedback
Magicoder 24480 1828.72 https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K
Magicoder-OSS 28980 1850.05 https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K
Biomedicine UltraMedical_mc 35163 891.06 https://huggingface.co/datasets/TsinghuaC3I/UltraMedical
Total / Avg. - 229763 1390.75 -

Training Details

The following hyperparameters were used during training:

Parameter Value
Fine-tuning Type Full
Data Max Length 6144
Learning Rate 1e-05
GPU Batch Size 2
Seed 42
Gradient Accumulation 2
Train Batch Size 96
Optimizer OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08
LR Schedule Cosine
Warmup Ratio 0.1
Epochs 3

RL Data Preprocessing

Data Filtering and Question-Type Classification

The preprocessing pipeline employs a systematic rule-based approach to filter and classify mathematical problems to create a high-quality dataset with solvable problems, appropriate difficulty levels, and correct solutions.

We exclude problems containing figures or diagrams since they require visual processing capabilities. We also remove proof questions due to difficulties in answer verification. The remaining problems are classified into question-answering, multiple-choice, or fill-in-the-blank questions based on specific patterns. Since fill-in-the-blank questions comprise less than 400 examples compared to the much larger set of multiple-choice questions, we focus solely on multiple-choice questions for further processing.

Converting to Direct Question-Answer Format

We transform multiple-choice questions into a direct question-answer format through three sequential stages: rule-based filtering, LLM-based filtering, and LLM-based formatting.

We first identify and remove questions that inherently require multiple-choice options - specifically, those where comparing specific statements or properties is essential to the problem-solving process. These questions cannot be meaningfully converted to a direct question-answer format. The initial filtering employs simple rule-based pattern matching, searching for keywords like "following" and "statement" that typically indicate option-dependent problems.

Following the rule-based filtering, we employ Llama-3.1-8B-Instruct to perform a more nuanced classification of the remaining questions. Our pilot study revealed that while the LLM occasionally misclassifies questions, it tends to err on the conservative side - marking potentially convertible questions as requiring options rather than the reverse. Given our large dataset, we accepted this conservative approach to maintain quality.

For questions classified as convertible, we implement a two-phase reformatting process:

  1. Question Reformatting: Removing choice indicators and restructuring the question to elicit direct answers
  2. Solution Reformatting: Converting multiple-choice solutions into step-by-step derivations, ensuring all final answers are presented in standard LaTeX boxed format

This systematic approach maintains mathematical rigor while creating a standardized format suitable for downstream applications.

Problem and Solution Validation

The final stage involves merging all question-answer pairs and performing LLM-based comprehensive validation. We identify two key aspects in validation: solvability and correctness.

We leverage state-of-the-art mathematical reasoning models, including QwQ-32B-Preview and Qwen2.5-Math-72B-Instruct, employing a self-consistency approach to determine problem solvability, and if solvable, verify the correctness of solutions provided in the original dataset.

To enhance validation accuracy, we first analyzed sample problems to identify characteristics of solvable and unsolvable cases and created synthetic unsolvable problems featuring missing conditions or logical contradictions. Based on these samples, we developed specialized prompts to improve the models' ability to distinguish solvability.

Each problem undergoes five independent validation attempts, where the LLM:

  1. Provides step-by-step solutions using LaTeX formatting
  2. Identifies insolvability due to missing conditions or logical contradictions
  3. Generates complete reasoning traces for solvable problems
  4. Presents final answers in standardized LaTeX boxed format (\boxed{})
  5. Documents any impediments to solution completion

We evaluate two key consistency measures across multiple validation attempts:

  • Status Consistency: Agreement on problem solvability
  • Answer Consistency:
    • Consistency of solutions across different attempts
    • Agreement between generated solutions and ground truth

The final dataset retains only problems that demonstrate:

  • Consistent solvability across validation attempts
  • Agreement in solutions across multiple attempts
  • Alignment with ground truth answers

This rigorous validation process ensures the resulting dataset comprises well-defined, solvable problems with verified, accurate solutions.

PRM Data

Stage 1

The dataset statistics of Stage 1 Training are listed below:

Dataset Generator Model Num. Inst Resp/Inst Step-level/Response-level
UltraInteract Llama-3.1-8B-Inst 20177 8 Response-level
UltraInteract Llama-3.1-8B-Base 13570 8 Response-level
UltraInteract Qwen2.5-72B-Inst 4758 8 Response-level
UltraInteract Qwen2.5-Math-7B-Base 25713 8 Response-level
Numina-SynMath Llama-3.1-8B-Inst 4783 8 Response-level
Numina-SynMath Qwen2.5-Math-7B-Base 5806 8 Response-level
Numina-Olympiads Llama-3.1-8B-Inst 2909 8 Response-level
Numina-Olympiads Qwen2.5-Math-7B-Base 4739 8 Response-level

Stage 2

The dataset statistics of Stage 2 Training are listed below:

Dataset Generator Model Num. Inst Resp/Inst Step-level/Response-level
MATH Llama-3.1-70B-Inst 4715 2 Step-level
MATH Qwen2.5-72B-Inst 6098 2 Step-level
UltraInteract Llama-3.1-70B-Inst 4238 2 Response-level

Other Results

Results of Different RL Algorithms

The results of different RL algorithms on Llama-3.1-8B are listed below. Since we used a different base model and dataset for the pilot study, the benchmarks used here are slightly different from the main experiments.

Step Algorithm Minerva Math Olympiad Bench HumanEval LeetCode LiveCode Bench Avg.
256 PPO 21.7 18.2 62.8 13.3 17.1 26.6
REINFORCE 21.7 19.0 64.6 13.9 17.1 27.3
GRPO 22.8 18.4 59.2 16.1 17.3 26.8
ReMax 22.8 19.6 58.5 12.8 15.8 25.9
RLOO 18.8 20.7 60.4 16.1 17.8 26.8
1024 REINFORCE 19.5 16.0 57.3 21.1 16.0 26.0
GRPO 22.4 20.3 57.3 13.3 18.7 26.4
ReMax 24.6 17.3 61.0 21.1 18.6 28.5
RLOO 21.0 20.6 57.9 27.8 21.4 29.7

Citation

If you find PRIME or ImplicitPRM helpful, please cite them.

@misc{cui2024process,
  title={Process Reinforcement through Implicit Rewards},
  author={Ganqu Cui and Lifan Yuan and Zefan Wang and Hanbin Wang and Wendi Li and Bingxiang He and Yuchen Fan and Tianyu Yu and Qixin Xu and Weize Chen and Jiarui Yuan and Huayu Chen and Kaiyan Zhang and Xingtai Lv and Shuo Wang and Yuan Yao and Hao Peng and Yu Cheng and Zhiyuan Liu and Maosong Sun and Bowen Zhou and Ning Ding},
  year={2025}
}
@article{yuan2024implicitprm,
  title={Free Process Rewards without Process Labels},
  author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
  journal={arXiv preprint arXiv:2412.01981},
  year={2024}
}