rStar

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 15 15:3
Editor
Edited
Edited
2025 Jan 15 15:12

without distillation

Methods for scaling test-time computation have shown limited gains in math reasoning, often due to policy LLM or reward model limitations. rStar-Math addresses this by iteratively evolving the policy LLM and reward model. rStar leverages a code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories with self-annotated MCTS Q-values. Q-values are imposed only on the successfully run Python code.
 
 
 
 
 
 

Recommendations