Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Neuron SAE/SAE Feature/SAE Steering/
SAE Steering Vector Hybrid
Search

SAE Steering Vector Hybrid

Creator
Creator
Seonglae Cho
Created
Created
2025 Feb 12 23:3
Editor
Editor
Seonglae Cho
Edited
Edited
2025 Mar 10 16:9
Refs
Refs
  • Top-down - Average the activation differences
  • Bottom-up - Extract and amplify relevant activation features
  • Hybrid - Use both approaches
 
 
 
 
Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong
TL;DR This project compares the effectiveness of top-down and bottom-up activation steering methods in controlling refusal behaviour. In line with pr…
Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong
https://www.lesswrong.com/posts/boB3hJiZijxM3J6Ed/comparing-the-effectiveness-of-top-down-and-bottom-up
Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts — LessWrong
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Neuron SAE/SAE Feature/SAE Steering/
SAE Steering Vector Hybrid
Copyright Seonglae Cho