Due to internal correlations within the model, it affects other features as well. This indicates the limitation of SAE, as it did not strictly achieve Monosemanticity.
"Evaluating feature steering: A case study in mitigating social biases"
A new piece of Anthropic research by Durmus et al.: "Evaluating feature steering: A case study in mitigating social biases"
https://www.anthropic.com/research/evaluating-feature-steering


Seonglae Cho