AI Agent Attack

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 20 23:38
Editor
Edited
Edited
2025 Jun 13 17:37
AI Agent Attacks
 
 
 
 
 

Multi-Agent Attacks

Prompt Infection
Due to the polysemantic activation space, SAE feature or token changes interfere with other token probabilities, becoming interfering features. In particular, this paper demonstrated increasing desired token probabilities through feature direction manipulation.
In highly polysemantic super-neurons, when amplified (increasing activation above 1), the model output changes significantly. However, when the same neurons are masked (reducing activation close to 0), the output barely changes, showing an asymmetric vulnerability phenomenon.
 
 

Recommendations