AI Agent Attacks
Multi-Agent Attacks
arxiv.org
https://arxiv.org/pdf/2402.08567
Prompt Infection
arxiv.org
https://arxiv.org/pdf/2410.07283
Multi-agent system Risk Mitigation by SAE Steering
LLM-Agent-SAE
Samsung • Updated 2025 Oct 4 15:47
Due to the polysemantic activation space, SAE feature or token changes interfere with other token probabilities, becoming interfering features. In particular, this paper demonstrated increasing desired token probabilities through feature direction manipulation.
In highly polysemantic super-neurons, when amplified (increasing activation above 1), the model output changes significantly. However, when the same neurons are masked (reducing activation close to 0), the output barely changes, showing an asymmetric vulnerability phenomenon.
arxiv.org
https://arxiv.org/pdf/2505.10670v1

Seonglae Cho