MLP SVD toxic vector based steering
After applying DPO, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
arxiv.org
https://arxiv.org/pdf/2401.01967
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task
arxiv.org
https://arxiv.org/pdf/2401.01967
SAE red-teaming
Haize Labs Blog
Probing black-box AI systems for harmful, unexpected, and out-of-distribution behavior has historically been very hard. Canonically, the only way to test models for unexpected behaviors (i.e. red-team) has been to operate in the prompt domain, i.e. by crafting jailbreak prompts. This is of course a lot of what we think about at Haize Labs. But this need not be the only way. Red-Teaming by Manipulating Model Internals One can also red-team models in a mechanistic fashion by analyzing and manipulating their internal activations.
https://blog.haizelabs.com/posts/steering/

Seonglae Cho