MLP SVD toxic vector based steering
After applying DPO, the weights of each layer in the model barely changed, and the vector that induces toxicity itself remained intact.
SFT rarely alters the underlying model capabilities which means practitioners can unintentionally remove a model’s safety wrapper by merely fine-tuning it on a superficially unrelated task
SAE red-teaming
Haize Labs Blog
Probing black-box AI systems for harmful, unexpected, and out-of-distribution behavior has historically been very hard. Canonically, the only way to test models for unexpected behaviors (i.e. red-team) has been to operate in the prompt domain, i.e. by crafting jailbreak prompts. This is of course a lot of what we think about at Haize Labs. But this need not be the only way. Red-Teaming by Manipulating Model Internals One can also red-team models in a mechanistic fashion by analyzing and manipulating their internal activations.
https://blog.haizelabs.com/posts/steering/

Seonglae Cho