Simulates a real software engineering environment (code modifications, builds, tests, etc.) and solves problems using various tools (terminal, file editing, etc.) within an RL environment
Hybrid Verifier
Combines execution-free (LLM evaluates patches) + execution-based (actual test execution) approaches
GRPO++ (to build DeepSWE, Together AI)
Drawing ideas from DAPO, Dr.GRPO, LOOP/RLOO and others, the following improvements were added:
- Clip High (DAPO): Increase surrogate loss upper bound to enhance exploration
- No KL Loss (DAPO)
- No Reward Std, Length Normalization (Dr.GRPO)
- Leave One Out: Reduce variance by removing one sample when estimating advantage (Loop/RLOO)
- Compact Filtering: Mask trajectories when reaching max context/steps/timeout
- No Entropy Loss: Remove entropy loss to prevent instability