- SAE has a lot of capability of interpretability including steering vector. However overall approach like relying on external LLMs or internal gradient or logit distiribution change lacks considering external use cases. I propose here that extracting specific features using text classification dataset and it actually helps to steer model for that direction. Also I proposed the way to steeringing coefficient with token position awaring and compared performance degradiation with steering vector method and sae decoder nati ve steering. performance degradiation between applying token location and amount of tokens to applying from at the end.
CorrSteer Abstract
Creator
Creator
Seonglae ChoCreated
Created
2025 Feb 18 14:42Editor
Editor
Seonglae ChoEdited
Edited
2025 Feb 18 14:42Refs
Refs