SAE-TS

Creator

Seonglae Cho

Created

2025 Feb 9 16:5

Editor

Seonglae Cho

Edited

2025 Mar 10 16:9

Refs

Collect how decoder steering vector affects encoded SAE features.

Train linear predictor that takes decoder steering vector as input and outputs difference of feature vector.

Combine optimized steering vector for target feature.

Activation values can also be used as steering vectors. There are two ways to obtain steering vectors from SAE features: simple decoding and SAE-TS. In both cases, coefficients play an important role. To understand features and efficiently steer LLMs, it is crucial to understand the factors and their patterns that affect SAE feature activation.

arxiv.org

https://arxiv.org/pdf/2411.02193

SAE-TS

Backlinks

Recommendations