Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/
CAA
Search

CAA

Creator
Creator
Seonglae Cho
Created
Created
2024 Oct 26 15:57
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Feb 17 13:4
Refs
Refs

Contrastive activation additions

Add the average activation diff to all tokens using simple
ActAdd
 
 
 
 

LLaMa 2
CAA
nrimsky • Updated 2025 Apr 16 12:17

Steering Llama-2 with contrastive activation additions — LessWrong
The effects of subtracting or adding a "sycophancy vector" to one bias term.  TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we o…
Steering Llama-2 with contrastive activation additions — LessWrong
https://www.lesswrong.com/posts/v7f8ayBxLhmMFRzpa/steering-llama-2-with-contrastive-activation-additions
Steering Llama-2 with contrastive activation additions — LessWrong
arxiv.org
https://arxiv.org/pdf/2312.06681
 
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/
CAA
Copyright Seonglae Cho