Mechanistic AI Jailbreak

MLP SVD toxic vector based steering

After applying

DPO, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.

arxiv.org

https://arxiv.org/pdf/2401.01967

SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task

arxiv.org

https://arxiv.org/pdf/2401.01967

SAE red-teaming

Haize Labs Blog

Probing black-box AI systems for harmful, unexpected, and out-of-distribution behavior has historically been very hard. Canonically, the only way to test models for unexpected behaviors (i.e. red-team) has been to operate in the prompt domain, i.e. by crafting jailbreak prompts. This is of course a lot of what we think about at Haize Labs. But this need not be the only way. Red-Teaming by Manipulating Model Internals One can also red-team models in a mechanistic fashion by analyzing and manipulating their internal activations.

https://blog.haizelabs.com/posts/steering/

Mechanistic AI Jailbreak

MLP SVD toxic vector based steering

SAE red-teaming

Recommendations