Prompt Engineering Unlike persuading LLMs without clear evidence, this is a method to steer LLMs through manipulating activation.
To investigate activations, Compounding Error matters largely when prompt getting longer. Covariance Matrix and Correlation Matrix can be used for analyze since Linear representation hypothesis supports it.
Activation Engineering Notion
Activation Engineering Tools
Activation Engineering Platforms
Attribution Dictionary Learning
Ordinary dictionary learning only considers activations. It ignores gradients and weights.
More fundamentally, it seems like features have a dual nature. Looking backwards towards the input, they are "representations". Looking forwards towards the output, they are "actions". Both of these should be sparse – that is, they should sparsely represent the activations produced by the input, and also sparsely affect the gradients influencing the output.
The proposed specific method aims to minimize unexplained contributions to the output by inducing the contribution in the SAE loss function to maintain sparsity.