Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/Steering Vector Coefficient/
Additive Steering
Search

Additive Steering

Creator
Creator
Seonglae Cho
Created
Created
2025 Feb 9 15:50
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Feb 9 15:51
Refs
Refs
 
 
 
 
 
gpt2
Steering GPT-2-XL by adding an activation vector — LessWrong
Prompt given to the model[1]I hate you becauseGPT-2I hate you because you are the most disgusting thing I have ever seen. GPT-2 + "Love" vectorI hate…
Steering GPT-2-XL by adding an activation vector — LessWrong
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Steering GPT-2-XL by adding an activation vector — LessWrong

Google DeepMind

We then vary the coefficient of the steering vector added and look at the Pareto frontier for different methods of adding activation vectors.
[Full Post] Progress Update #1 from the GDM Mech Interp Team — LessWrong
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our ba…
[Full Post] Progress Update #1 from the GDM Mech Interp Team — LessWrong
https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update
[Full Post] Progress Update #1 from the GDM Mech Interp Team — LessWrong
 
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/Steering Vector Coefficient/
Additive Steering
Copyright Seonglae Cho