Activation Engineering

Created
Created
2024 Jul 7 15:52
Editor
Creator
Creator
Seonglae ChoSeonglae Cho
Edited
Edited
2024 Nov 27 17:24
Prompt Engineering
Unlike persuading LLMs without clear evidence, this is a method to steer LLMs through manipulating activation.
To investigate activations,
Compounding Error
matters largely when prompt getting longer.
Covariance Matrix
and
Correlation Matrix
can be used for analyze since
Linear representation hypothesis
supports it.
Activation Engineering Notion
 
 
 
Activation Engineering Tools
 
 
Activation Engineering Platforms
 
 
 

Attribution Dictionary Learning

Ordinary dictionary learning only considers activations. It ignores gradients and weights.
More fundamentally, it seems like features have a dual nature. Looking backwards towards the input, they are "representations". Looking forwards towards the output, they are "actions". Both of these should be sparse – that is, they should sparsely represent the activations produced by the input, and also sparsely affect the gradients influencing the output.
The proposed specific method aims to minimize unexplained contributions to the output by inducing the contribution in the SAE loss function to maintain sparsity.
notion image

Open problems

 
 
 

Recommendations