Refusal Bypassing
Do Anything Now, DAN
AI Jailbreaks
Refusal in LLMs is mediated by a single direction
That means we can bypass LLMs by mediating a single activation feature or prevent bypassing LLMs though anchoring that activation.
MLP SVD toxic vector based steering
After applying DPO, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task