OpenAI has stopped publishing evaluation reports for the SWE-bench Verified benchmark and recommended moving to SWE-bench Pro. An audit found test-design flaws in 59.4% of the failed tasks: 35.5% used overly narrow tests that only allowed a specific implementation approach, and 18.8% included tests that required behavior not stated in the problem description. The audit also identified training-data contamination, with frontier models reproducing the gold patches verbatim from public repositories.
Why SWE-bench Verified no longer measures frontier coding capabilities
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

SWE-bench
SWE-bench: Evaluate Language Models on Open Source Software Tasks
https://www.swebench.com/

Seonglae Cho