ActAdd

Creator
Creator
Seonglae Cho
Created
Created
2024 Oct 26 15:59
Editor
Edited
Edited
2025 Mar 10 15:59
Refs
Refs
CAA
Similarly uses mean diff, but experimentally finds more optimal injection layers, positions, and injection coefficients

Contrastive Activation

Unlike
Neuron SAE
, it directly targets the difference in model activations between two scenarios. While this is useful for capturing binary or large behavioral changes, it makes precise control more difficult and may affect multiple related behaviors simultaneously. In other words, it could potentially impact other behaviors as well.
 
 
 

GPT2-XL (48 layers)

Limitation

The effects of SV within the dataset vary significantly across inputs, and in some cases, produced results opposite to the desired effects. The generalization performance of SV across datasets is limited, working well only between settings that exhibit similar behaviors, and when attempting to guide the model towards behaviors it hadn't previously performed, the generalization performance decreased further.

SAE based feature pruning

Prune features which activates at both contrastive prompt
 
 

Recommendations