Creating a 2-step process with reasoning abstraction generation to incentivize breadth rather than depth. The Solution Generator's reward remains the same as before, using Verifiable Reward, while the Abstraction Generator's reward is evaluated based on the success rate of the solution generator utilizing that abstraction. Essentially, it has an inner loop and is similar to splitting GRPO into 2 steps since it fundamentally uses the same context.
The paper is well-written, but the summarization trick idea is rather incremental.