GRPO
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open...
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training...
https://arxiv.org/abs/2402.03300

Sets self-verifiable reasoning as an explicit goal. Instead of final answer accuracy, focuses on verifying the consistency and completeness of the reasoning process. Directly addresses limitations of existing RLHF/final-answer reward approaches.
- First train a Verifier for theorem proving
- Train Proof Generator with RL using Verifier scores as rewards
- As Generator improves, scale verification compute to automatically label harder proofs → retrain Verifier
Generator–Verifier arms race structure. IMO 2025: Gold level, CMO 2024: Gold level, Putnam 2024: 118/120 (near-perfect)
deepseek-ai/DeepSeek-Math-V2 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/deepseek-ai/DeepSeek-Math-V2
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced.
By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year.
However, this approach faces fundamental limitations.
Pursuing higher final answer accuracy doesn’t address a key issue: correct answers don’t guarantee correct reasoning.
Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable.
To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning.
Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions.
Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving.
We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them.
To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier.
Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
While much work remains, these results suggest that self-verifiable mathematical reasoning is a feasible research direction that may help develop more capable mathematical AI systems.
https://arxiv.org/html/2511.22570v1

Seonglae Cho