CorrSteer Abstract

SAE has a lot of capability of interpretability including steering vector. However overall approach like relying on external LLMs or internal gradient or logit distiribution change lacks considering external use cases. I propose here that extracting specific features using text classification dataset and it actually helps to steer model for that direction. Also I proposed the way to steeringing coefficient with token position awaring and compared performance degradiation with steering vector method and sae decoder nati ve steering. performance degradiation between applying token location and amount of tokens to applying from at the end.