AI Agent Attacks
Multi-Agent Attacks
Prompt Infection
Multi-agent system Risk Mitigation by SAE Steering
LLM-Agent-SAE
Samsung • Updated 2025 May 21 8:19
Due to the polysemantic activation space, SAE feature or token changes interfere with other token probabilities, becoming interfering features. In particular, this paper demonstrated increasing desired token probabilities through feature direction manipulation.
In highly polysemantic super-neurons, when amplified (increasing activation above 1), the model output changes significantly. However, when the same neurons are masked (reducing activation close to 0), the output barely changes, showing an asymmetric vulnerability phenomenon.