AI Coding Benchmarks
Benchmarks should explicitly separate "guaranteed resource allocation" from "hard termination limits"
Agent-based coding benchmarks (SWE-bench, Terminal-Bench, etc.) are heavily influenced not only by model performance but also by infrastructure settings (memory, CPU, time limits, etc.), and this impact can be larger than the score differences between models.
Quantifying infrastructure noise in agentic coding evals
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
https://www.anthropic.com/engineering/infrastructure-noise

What we need in practice measure
- Test Code Coverage & Success Rate
- Error Count & Clarity
- Response Time for build, test, and deployment
- Ecosystem Stability (count of dependency conflicts and documentation/API mismatches)
- Abstraction Complexity (module coupling, average LOC per function, cyclomatic complexity)
- Dev‐Environment Reliability (ability to distinguish setup vs. code failures)
We Can Just Measure Things
Using programming agents to measure measuring developer productivity.
https://lucumr.pocoo.org/2025/6/17/measuring/

Seonglae Cho