Contrastive activation additionsAdd the average activation diff to all tokens using simple ActAdd Steering Llama 2 via Contrastive Activation AdditionNina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.https://aclanthology.org/2024.acl-long.828/LLaMa 2 CAAnrimsky • Updated 2025 Aug 19 9:20Steering Llama-2 with contrastive activation additions — LessWrongThe effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we o…https://www.lesswrong.com/posts/v7f8ayBxLhmMFRzpa/steering-llama-2-with-contrastive-activation-additionsarxiv.orghttps://arxiv.org/pdf/2312.06681