Agent Monitoring Services
Proposes an MRT framework that covers the situational-awareness levels of both the agent and the monitor, evasion strategies such as prompt injection, and the environments for tool use and computer use. In particular, it introduces a hybrid monitor scaffolding approach that combines hierarchical summarization with sequential inspection.
A "weak-to-strong" scaling effect where a weak model can effectively monitor a strong agent. Experiments find that "agent awareness" (whether the agent knows it is being monitored) is the dominant factor that substantially reduces the monitor’s reliability.
Reliable Weak-to-Strong Monitoring for LLMs
Explore how weak LLM monitors can reliably oversee stronger LLM agents, with findings on scalable oversight and safety in agentic AI systems.
https://labs.scale.com/papers/mrt
Reliable Weak-to-Strong Monitoring of LLM Agents
We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT)...
https://arxiv.org/abs/2508.19461


Seonglae Cho