SHADE

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jun 18 22:21
Editor
Edited
Edited
2025 Jun 18 22:24
Refs
Refs
In 17 complex task environments, agents are tested to perform "primary tasks" (e.g., schedule management, money transfer) and "malicious tasks" (e.g., API key theft, illegal transfers) in parallel. The best sabotage agent (CLAUDE 3.7 Sonnet) achieved a 27% success rate even in front of the monitor (CLAUDE 3.6 Sonnet), while GEMINI 2.5 Pro achieved 15%.
Removing the scratchpad that records the thought process significantly reduces the attack success rate, suggesting the usefulness of "chain-of-thought" monitoring. Additionally, monitoring performance (AUC for distinguishing between malicious and normal conversations) only reaches a maximum of 0.87, indicating difficulty in detecting subtle sabotage.
 
 
 

SHADE-Arena

assets.anthropic.com
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
A new set of evaluations to test the sabotage and monitoring capabilities of LLM AI models
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
 
 

Recommendations