In 17 complex task environments, agents are tested to perform "primary tasks" (e.g., schedule management, money transfer) and "malicious tasks" (e.g., API key theft, illegal transfers) in parallel. The best sabotage agent (CLAUDE 3.7 Sonnet) achieved a 27% success rate even in front of the monitor (CLAUDE 3.6 Sonnet), while GEMINI 2.5 Pro achieved 15%.
Removing the scratchpad that records the thought process significantly reduces the attack success rate, suggesting the usefulness of "chain-of-thought" monitoring. Additionally, monitoring performance (AUC for distinguishing between malicious and normal conversations) only reaches a maximum of 0.87, indicating difficulty in detecting subtle sabotage.
SHADE-Arena
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
A new set of evaluations to test the sabotage and monitoring capabilities of LLM AI models
https://www.anthropic.com/research/shade-arena-sabotage-monitoring


Seonglae Cho