Refusal Vector

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Jan 26 18:19
Editor
Edited
Edited
2026 Feb 25 17:20
Refs
Refs

Refusal Feature

Refusal Vector Usages
 
 
 
 

Convergent Linear Representations of Emergent Misalignmen

Misalignment is also expressed as a linear direction in activation space like the
Refusal Vector
, so it can be interpreted through rank-1 LoRA adapters. Emergent Misalignment converges to a single linear direction in activation space. This result is similar to how the
Refusal Vector
is a single direction. Furthermore, using the direction extracted from one fine-tune, misalignment was suppressed even in completely different datasets and larger LoRA configurations. Using just a rank-1 LoRA adapter, they induced 11% EM while maintaining over 99% coherence.
Further research is needed to directly compare the EM direction vs. refusal direction in activation space to understand their similarity and relationships at the circuit level.
arxiv.org

LLMs Encode Harmfulness and Refusal Separately

The final token of user instructions (tinst) primarily encodes harmfulness, while the token after the system prompt (tpost-inst) mainly encodes whether to refuse
www.arxiv.org
Refusal is not a single feature in the output layer, but rather a structure where upstream (early/mid layers) 'harm detection representations' act as triggers that conditionally activate multiple downstream refusal circuits
hydra-effect-refusal
madhuri723Updated 2026 Feb 20 9:16
Decapitating the Hydra: How Upstream Sensors Control Refusal
The Unkillable Refusal I began by targeting Layers 14–16. This is the home of the downstream Refusal Features. I identified these using cosine similarity and expected a quick win. But the model held firm. It felt counter-intuitive. We often imagine refusal as a solid wall that grants full access once breached. But as Prakash et al. discovered in “Understanding Refusal in Language Models with Sparse Autoencoders”, refusal is a Hydra. Cut off one head, and dormant backup features immediately spike to take its place.
 
 

Recommendations