SWEBench

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Mar 2 18:29
Editor
Edited
Edited
2026 Apr 17 17:49
Refs
Refs
 
 
 
OpenAI has stopped publishing evaluation reports for the SWE-bench Verified benchmark and recommended moving to SWE-bench Pro. An audit found test-design flaws in 59.4% of the failed tasks: 35.5% used overly narrow tests that only allowed a specific implementation approach, and 18.8% included tests that required behavior not stated in the problem description. The audit also identified training-data contamination, with frontier models reproducing the gold patches verbatim from public repositories.
Why SWE-bench Verified no longer measures frontier coding capabilities
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
Why SWE-bench Verified no longer measures frontier coding capabilities
SWE-bench
SWE-bench: Evaluate Language Models on Open Source Software Tasks
 
 
 

Recommendations