AI Agent Benchmark

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Feb 27 8:4
Editor
Edited
Edited
2026 Feb 18 17:20

Define step -> Collect data -> Make evals

  • End-to-End is not just a sum of components of eval pipeline
  • Hard to obey for Good benchmark principles
Computer Use Benchmarks
 
 
Browser Use Benchmarks
 
 
AI Agent Reasoning Benchmarks
 
Tool learning Benchmarks
 
 
GUI grounding Benchmarks
 
 
 

Agent Benchmarks

Cost-controlled evaluations and joint optimization of accuracy and cost
AI Agents That Matter
AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that...
AI Agents That Matter
Stagehand GUI grounding
Visual Grounding

OS-Harm benchmark

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity,...
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
BrowseSafe: AI browser-specific 'real-time malicious prompt detection model'
Building Safer AI Browsers with BrowseSafe
Explore Perplexity's blog for articles, announcements, product updates, and tips to optimize your experience. Stay informed and make the most of Perplexity.
Building Safer AI Browsers with BrowseSafe
 
 

Recommendations