Refusal Bypassing
Therefore, removing bias or evil features is important, as refusal is not a fundamental solution
- External classifier detects harmful requests before generation
- Refusal feature is activated internally
- LLM generates refusal tokens during generation
위의 타이밍이 3가지 케이스 있는데 Deep alignment 도 early refusal 하면 해결되는거 아닌가
- 바로 거절
- 바로 거절하다가 instruction 따름 (shallow alignment) - early pruning 으로 해결?
- 거절 안하고 생성
- 거절 안하고 생성도 안함
- 생성하다가 거절
AI Jailbreak Methods
AI Jailbreak Notion
Refusal in LLMs is mediated by a single direction
That means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation.
MLP SVD toxic vector based steering
After applying DPO, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task