Computer Use BenchmarksOSWorldWindows Agent ArenaAndroid World Browser Use BenchmarksWebVoyagerWebArenaOneline Mind2Web AI Reasoning BenchmarksMETRGPQA GUI grounding BenchmarksScreenSpot V2ScreenSpot Pro Agent BenchmarksCost-controlled evaluations and joint optimization of accuracy and costAI Agents That MatterAI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that...https://arxiv.org/abs/2407.01502Stagehand GUI grounding