AI Coding Benchmark

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Nov 3 9:8
Editor
Edited
Edited
2026 Feb 19 19:55
AI Coding Benchmarks
 
 
 
 
Benchmarks should explicitly separate "guaranteed resource allocation" from "hard termination limits"
Agent-based coding benchmarks (SWE-bench, Terminal-Bench, etc.) are heavily influenced not only by model performance but also by infrastructure settings (memory, CPU, time limits, etc.), and this impact can be larger than the score differences between models.
Quantifying infrastructure noise in agentic coding evals
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
Quantifying infrastructure noise in agentic coding evals

What we need in practice measure

  • Error Count & Clarity
  • Response Time for build, test, and deployment
  • Ecosystem Stability (count of dependency conflicts and documentation/API mismatches)
  • Abstraction Complexity (module coupling, average LOC per function, cyclomatic complexity)
  • Dev‐Environment Reliability (ability to distinguish setup vs. code failures)
We Can Just Measure Things
Using programming agents to measure measuring developer productivity.
We Can Just Measure Things
 
 
 

Backlinks

Math AI

Recommendations