Cross-Domain Effects by Feature Steering

Creator

Creator

Seonglae Cho

Created

Created

2024 Dec 1 12:2

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Dec 1 12:9

Refs

Refs

Due to internal correlations within the model, it affects other features as well. This indicates the limitation of

Sparse Autoencoder, as it did not strictly achieve

Monosemanticity.

"Evaluating feature steering: A case study in mitigating social biases"

A new piece of Anthropic research by Durmus et al.: "Evaluating feature steering: A case study in mitigating social biases"

https://www.anthropic.com/research/evaluating-feature-steering

"Evaluating feature steering: A case study in mitigating social biases"

Recommendations

////////