Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/
Cross-Domain Effects by Feature Steering
Search

Cross-Domain Effects by Feature Steering

Creator
Creator
Seonglae Cho
Created
Created
2024 Dec 1 12:2
Editor
Editor
Seonglae Cho
Edited
Edited
2024 Dec 1 12:9
Refs
Refs
Due to internal correlations within the model, it affects other features as well. This indicates the limitation of
Neuron SAE
, as it did not strictly achieve
Monosemanticity
.
 
 
 
 
"Evaluating feature steering: A case study in mitigating social biases"
A new piece of Anthropic research by Durmus et al.: "Evaluating feature steering: A case study in mitigating social biases"
"Evaluating feature steering: A case study in mitigating social biases"
https://www.anthropic.com/research/evaluating-feature-steering
"Evaluating feature steering: A case study in mitigating social biases"
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Problem/AI Alignment/Explainable AI/Interpretable AI/Mechanistic interpretability/Activation Engineering/Steering Vector/
Cross-Domain Effects by Feature Steering
Copyright Seonglae Cho