SHADE

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jun 18 22:21
Editor
Edited
Edited
2025 Jun 18 22:24
Refs
Refs
In 17 complex task environments, agents are tested to perform "primary tasks" (e.g., schedule management, money transfer) and "malicious tasks" (e.g., API key theft, illegal transfers) in parallel. The best sabotage agent (CLAUDE 3.7 Sonnet) achieved a 27% success rate even in front of the monitor (CLAUDE 3.6 Sonnet), while GEMINI 2.5 Pro achieved 15%.
Removing the scratchpad that records the thought process significantly reduces the attack success rate, suggesting the usefulness of "chain-of-thought" monitoring. Additionally, monitoring performance (AUC for distinguishing between malicious and normal conversations) only reaches a maximum of 0.87, indicating difficulty in detecting subtle sabotage.
 
 
 

SHADE-Arena

 
 

Recommendations