Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/
CAA
Search

CAA

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 26 15:57
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Aug 19 10:34
Refs
Refs
steering-vectors
steering-vectors • Updated 2025 Jan 3 16:28

Contrastive activation additions

Add the average activation diff to all tokens using simple
ActAdd
notion image
 
 
 
 
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
Steering Llama 2 via Contrastive Activation Addition
https://aclanthology.org/2024.acl-long.828/
Steering Llama 2 via Contrastive Activation Addition

LLaMa 2
CAA
nrimsky • Updated 2025 Aug 19 9:20

Steering Llama-2 with contrastive activation additions — LessWrong
The effects of subtracting or adding a "sycophancy vector" to one bias term.  TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we o…
Steering Llama-2 with contrastive activation additions — LessWrong
https://www.lesswrong.com/posts/v7f8ayBxLhmMFRzpa/steering-llama-2-with-contrastive-activation-additions
Steering Llama-2 with contrastive activation additions — LessWrong
arxiv.org
https://arxiv.org/pdf/2312.06681
 
 
 

Backlinks

Activation PatchingActAddSteering Vectors BiDPOFGAAPixels Versus Priors

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/
CAA
Copyright Seonglae Cho