- Collect how decoder steering vector affects encoded SAE features.
- Train linear predictor that takes decoder steering vector as input and outputs difference of feature vector.
- Combine optimized steering vector for target feature.
Activation values can also be used as steering vectors. There are two ways to obtain steering vectors from SAE features: simple decoding and SAE-TS. In both cases, coefficients play an important role. To understand features and efficiently steer LLMs, it is crucial to understand the factors and their patterns that affect SAE feature activation.