DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
Sparse autoencoders (SAEs) have emerged as a dominant paradigm for mechanistic interpretability, allowing for increased visibility into the semantic content of LLM hidden states. Recent work has...
https://openreview.net/forum?id=1ARWFG6IwJ

Seonglae Cho