Define step -> Collect data -> Make evals
- End-to-End is not just a sum of components of eval pipeline
- Hard to obey for Good benchmark principles
Computer Use Benchmarks
Browser Use Benchmarks
AI Agent Reasoning Benchmarks
GUI grounding Benchmarks
Agent Benchmarks
Cost-controlled evaluations and joint optimization of accuracy and cost
AI Agents That Matter
AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that...
https://arxiv.org/abs/2407.01502

Stagehand GUI grounding Visual Grounding
OS-Harm benchmark
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity,...
https://arxiv.org/abs/2506.14866

BrowseSafe: AI browser-specific 'real-time malicious prompt detection model'
Building Safer AI Browsers with BrowseSafe
Explore Perplexity's blog for articles, announcements, product updates, and tips to optimize your experience. Stay informed and make the most of Perplexity.
https://www.perplexity.ai/hub/blog/building-safer-ai-browsers-with-browsesafe


Seonglae Cho