CAA Similarly uses mean diff, but experimentally finds more optimal injection layers, positions, and injection coefficients
Contrastive Activation
Unlike Neuron SAE, it directly targets the difference in model activations between two scenarios. While this is useful for capturing binary or large behavioral changes, it makes precise control more difficult and may affect multiple related behaviors simultaneously. In other words, it could potentially impact other behaviors as well.
GPT2-XL (48 layers)
Limitation
The effects of SV within the dataset vary significantly across inputs, and in some cases, produced results opposite to the desired effects. The generalization performance of SV across datasets is limited, working well only between settings that exhibit similar behaviors, and when attempting to guide the model towards behaviors it hadn't previously performed, the generalization performance decreased further.
SAE based feature pruning
Prune features which activates at both contrastive prompt