Spectral Editing of Activations
Proposes a method to modify internal LLM activations with the goals of enhancing factuality and reducing bias
- Preserves directions with high covariance with positive attributes
- Removes directions with high covariance with negative attributes
Computed using SVD-based spectral decomposition, applied during inference by projecting/restoring at the last few Transformer layers.