Contrastive activation additions
Add the average activation diff to all tokens using simple ActAdd

Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
https://aclanthology.org/2024.acl-long.828/
LLaMa 2 CAAnrimsky • Updated 2025 Nov 15 13:24
CAA
nrimsky • Updated 2025 Nov 15 13:24
Steering Llama-2 with contrastive activation additions — LessWrong
The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we o…
https://www.lesswrong.com/posts/v7f8ayBxLhmMFRzpa/steering-llama-2-with-contrastive-activation-additions
STA
filtering frequency makes sense but filtering based on magnitude does not make sense

Seonglae Cho