Generation-time SAE feature steering across all layers, optimized per task to maximize accuracy with minimum side effects; correlation-extracted features reveal semantic task relevance
corrsteer push the static steering boundary to improve task specific performance by literally using every single layer of the transformers. Push to the limit of static steeirng
afak please correct me i am the first man in human history successfully steered llm"s every single laye at the same time with improving performance
one more thing, per each layer
Problem. Adjusting a large language model toward a specific behavior, such as better accuracy, refusing harmful requests, or reducing bias, is usually done by fine-tuning, which acts like editing a genome with a shotgun: it hits the target but quietly damages unrelated abilities. Lighter alternatives that adjust the model's internal signals exist, but they need paired good/bad examples or huge memory, and they pick what to amplify from the input prompt rather than the actual output behavior.
Solution. We introduce CorrSteer, an automated method that watches which interpretable building blocks light up while the model produces correct answers, then amplifies those blocks across every layer in real time as it writes. We pick blocks by correlating their activity with task success, then verify the link by amplifying each block and checking whether performance improves. We also introduce the Side Effect Ratio, a simple measure of how many unrelated answers change per improved answer.
Impact. CorrSteer matches fine-tuning accuracy on knowledge tasks while halving the side-effect cost, works on any task that can be scored correct or incorrect, and the chosen blocks are mostly semantic concepts that humans can read, offering a continuous safety dial that can be tuned or turned off without retraining.
But dydnmaic steering didn't still resolved this. people think dynamic steering will easily win the performance. However, I didn't mentioned on the paper, but simply adding a gating mechanism on the corrtseer does worsen the performance which implicitly indicates linear steering itself has a clear instability and potentially dynamic linear steering is not gonna be feasible.
Key takeaways
- Generation-time features better reflect an LLM’s capabilities.
- We can enhance LLMs by filling sparse activations with a feature per layer
- Just as the ADHD activation is overall high and non-sparse?
Our next paper explores dynamic steering → please follow me on X and LinkedIn!
Seonglae Cho