Analysing the Generalisation and Reliability of Steering Vectors
Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of...
https://openreview.net/forum?id=v8X70gTodR