CAA

Creator

Creator

Seonglae Cho

Created

Created

2024 Oct 26 15:57

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Nov 16 23:43

Refs

Refs

Contrastive activation additions

Add the average activation diff to all tokens using simple

notion image

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

Steering Llama 2 via Contrastive Activation Addition

https://aclanthology.org/2024.acl-long.828/

Steering Llama 2 via Contrastive Activation Addition

LLaMa 2
CAA
nrimsky • Updated 2026 Feb 15 1:1

Steering Llama-2 with contrastive activation additions — LessWrong

The effects of subtracting or adding a "sycophancy vector" to one bias term. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we o…

Steering Llama-2 with contrastive activation additions — LessWrong

https://www.lesswrong.com/posts/v7f8ayBxLhmMFRzpa/steering-llama-2-with-contrastive-activation-additions

Steering Llama-2 with contrastive activation additions — LessWrong

https://arxiv.org/pdf/2312.06681

STA

filtering frequency makes sense but filtering based on magnitude does not make sense

https://arxiv.org/pdf/2505.20322

Recommendations

////////