Semantics-Adaptive Dynamic Intervention
Generates dynamic steering vectors that reflect different semantics for each input
- Calculate activation differences between contrastive pairs (positive vs negative) → identify important components (attention heads, hidden states, neurons).
- During inference, apply element-wise scaling according to the input's activation → results in semantically appropriate intervention (direction).
Simply element-wise masking with contrastive dataset like CAA
Semantics-Adaptive Activation Intervention for LLMs via Dynamic...
Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an...
https://openreview.net/forum?id=8WQ7VTfPTl

Seonglae Cho