ActAdd

CAA Similarly uses mean diff, but experimentally finds more optimal injection layers, positions, and injection coefficients

Contrastive Activation

Unlike

Neuron SAE, it directly targets the difference in model activations between two scenarios. While this is useful for capturing binary or large behavioral changes, it makes precise control more difficult and may affect multiple related behaviors simultaneously. In other words, it could potentially impact other behaviors as well.

GPT2-XL (48 layers)

https://colab.research.google.com/drive/1FEB5DPUDfbQRmhGWxmtX8cG67fidOjVJ?usp=sharing

arxiv.org

https://arxiv.org/pdf/2308.10248

arxiv.org

https://arxiv.org/pdf/2308.10248

Limitation

The effects of SV within the dataset vary significantly across inputs, and in some cases, produced results opposite to the desired effects. The generalization performance of SV across datasets is limited, working well only between settings that exhibit similar behaviors, and when attempting to guide the model towards behaviors it hadn't previously performed, the generalization performance decreased further.

arxiv.org

https://arxiv.org/pdf/2407.12404

Steering GPT-2-XL by adding an activation vector — LessWrong

Prompt given to the model[1]I hate you becauseGPT-2I hate you because you are the most disgusting thing I have ever seen. GPT-2 + "Love" vectorI hate…