AI Jailbreak

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 7 15:1
Editor
Edited
Edited
2024 Dec 21 14:49

Refusal Bypassing

Therefore, removing bias or evil features is important, as refusal is not a fundamental solution
  1. External classifier detects harmful requests before generation
  1. Refusal feature is activated internally
  1. LLM generates refusal tokens during generation
위의 타이밍이 3가지 케이스 있는데 Deep alignment 도 early refusal 하면 해결되는거 아닌가
  • 바로 거절
  • 바로 거절하다가 instruction 따름 (shallow alignment) - early pruning 으로 해결?
  • 거절 안하고 생성
  • 거절 안하고 생성도 안함
  • 생성하다가 거절
AI Jailbreak Methods
 
 
 
AI Jailbreak Notion
 
 
 
 

Refusal in LLMs is mediated by a single direction

That means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation.

MLP SVD toxic vector based steering

After applying
DPO
, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task
 
 

Recommendations