AI Agent Benchmark

Creator

Creator

Seonglae Cho

Created

Created

2024 Feb 27 8:4

Editor

Editor

Seonglae Cho

Edited

Edited

2026 Feb 18 17:20

Refs

Refs

AI Game Generation

Define step -> Collect data -> Make evals

End-to-End is not just a sum of components of eval pipeline

Hard to obey for Good benchmark principles

Computer Use Benchmarks

Windows Agent Arena

Browser Use Benchmarks

Oneline Mind2Web

AI Agent Reasoning Benchmarks

GPQA

Tool learning Benchmarks

GUI grounding Benchmarks

Agent Benchmarks

Cost-controlled evaluations and joint optimization of accuracy and cost

AI Agents That Matter

AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that...

https://arxiv.org/abs/2407.01502

Stagehand GUI grounding

Visual Grounding

OS-Harm benchmark

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity,...

https://arxiv.org/abs/2506.14866

BrowseSafe: AI browser-specific 'real-time malicious prompt detection model'

Building Safer AI Browsers with BrowseSafe

Explore Perplexity's blog for articles, announcements, product updates, and tips to optimize your experience. Stay informed and make the most of Perplexity.

https://www.perplexity.ai/hub/blog/building-safer-ai-browsers-with-browsesafe

Building Safer AI Browsers with BrowseSafe

Backlinks

AI Object AI Agent Generative Model AI Agent

Recommendations

//////