Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Hacking/AI Red teaming/AI Jailbreak/Refusal Vector/
Refusal Vector based Attack
Search

Refusal Vector based Attack

Creator
Creator
Seonglae Cho
Created
Created
2025 Mar 13 16:32
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Mar 13 16:34
Refs
Refs
  • ReFAT
  • Subspace Rerouting
 
 
 

SAE refusal feature (
SAE Feature
)

Steering Language Model Refusal with Sparse Autoencoders
Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Mini’s robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steering’s adverse effects on performance.
Steering Language Model Refusal with Sparse Autoencoders
https://arxiv.org/html/2411.11296v1
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Hacking/AI Red teaming/AI Jailbreak/Refusal Vector/
Refusal Vector based Attack
Copyright Seonglae Cho