AI Agent Attack

AI Agent Attacks

BrowserART

Indirect Prompt Injection

Multi AI Agent Attacks

Agent Replication Attack

Multi-Agent Attacks

arxiv.org

https://arxiv.org/pdf/2402.08567

Multi-agent system Risk Mitigation by

SAE Steering

LLM-Agent-SAE

Samsung • Updated 2025 Oct 4 15:47

Due to the polysemantic activation space, SAE feature or token changes interfere with other token probabilities, becoming interfering features. In particular, this paper demonstrated increasing desired token probabilities through feature direction manipulation.

In highly polysemantic super-neurons, when amplified (increasing activation above 1), the model output changes significantly. However, when the same neurons are masked (reducing activation close to 0), the output barely changes, showing an asymmetric vulnerability phenomenon.

arxiv.org

https://arxiv.org/pdf/2505.10670v1

AI Agent Attack

Multi-Agent Attacks

Recommendations