Agent Monitoring

Agent Monitoring Services

Proposes an MRT framework that covers the situational-awareness levels of both the agent and the monitor, evasion strategies such as prompt injection, and the environments for tool use and computer use. In particular, it introduces a hybrid monitor scaffolding approach that combines hierarchical summarization with sequential inspection.

A "weak-to-strong" scaling effect where a weak model can effectively monitor a strong agent. Experiments find that "agent awareness" (whether the agent knows it is being monitored) is the dominant factor that substantially reduces the monitor’s reliability.

Reliable Weak-to-Strong Monitoring for LLMs

Explore how weak LLM monitors can reliably oversee stronger LLM agents, with findings on scalable oversight and safety in agentic AI systems.

https://labs.scale.com/papers/mrt

Reliable Weak-to-Strong Monitoring for LLMs

Reliable Weak-to-Strong Monitoring of LLM Agents

We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT)...

https://arxiv.org/abs/2508.19461

Monitor Red Teaming(MRT) benchmark

Systematically stress-test the reliability of monitoring systems at detecting covert misbehavior, such as leaking sensitive information or sabotaging the system.

Reliable Weak-to-Strong Monitoring of LLM Agents

We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT)...

https://arxiv.org/abs/2508.19461

SHADE Arena

arxiv.org

https://arxiv.org/pdf/2506.15740

Agent Monitoring

Monitor Red Teaming(MRT) benchmark

Recommendations