Contrastive activation additions
Add the average activation diff to all tokens using simple ActAdd

Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
https://aclanthology.org/2024.acl-long.828/
LLaMa 2 CAAnrimsky • Updated 2026 Feb 15 1:1
CAA
nrimsky • Updated 2026 Feb 15 1:1
Steering Llama-2 with contrastive activation additions — LessWrong
The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we o…
https://www.lesswrong.com/posts/v7f8ayBxLhmMFRzpa/steering-llama-2-with-contrastive-activation-additions
arxiv.org
https://arxiv.org/pdf/2312.06681
STA
filtering frequency makes sense but filtering based on magnitude does not make sense
arxiv.org
https://arxiv.org/pdf/2505.20322

Seonglae Cho