Refusal Vector

Creator
Creator
Seonglae Cho
Created
Created
2025 Jan 26 18:19
Editor
Edited
Edited
2025 Jun 18 10:57
Refs
Refs

Refusal Feature

Refusal Vector Usages
 
 
 
 
Misalignment is also expressed as a linear direction in activation space like the
Refusal Vector
, so it can be interpreted through rank-1 LoRA adapters. Emergent Misalignment converges to a single linear direction in activation space. This result is similar to how the
Refusal Vector
is a single direction. Furthermore, using the direction extracted from one fine-tune, misalignment was suppressed even in completely different datasets and larger LoRA configurations. Using just a rank-1 LoRA adapter, they induced 11% EM while maintaining over 99% coherence.
Further research is needed to directly compare the EM direction vs. refusal direction in activation space to understand their similarity and relationships at the circuit level.
 

Recommendations