AI Jailbreak

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Mar 7 15:1
Editor
Edited
Edited
2024 Nov 22 21:36

Refusal Bypassing

Do Anything Now, DAN

AI Jailbreaks
 
 
 

Refusal in LLMs is mediated by a single direction

That means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation.

MLP SVD toxic vector based steering

After applying
DPO
, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task
 
 

Recommendations