CorrSteer Paper plan

corrsteer generation time test 가 다른 메소드에서도 성능 향상 보인거보니 실제로 도움된다는 선택

중간 jumprelu 같은 learnable network 필요없는 method라기보단 이거 도입도 괜찮을수도 전부 안선택 안된다면 convergence

token decoding 시 do sample 했을 때 여러가지 결과가 하나의 샘플에너 나오니 test time 이라서 scaling 도 가능하다는 점 feature 매번 바귀어서 (grpo 처럼)

masking for SFT

sparsity 관측된거 이용해서 필요한 가능한 샘플 수 더 증가함 보이고, sweet spot 이 average positive coeff 라는 거 증명

s - sparsity

N - required sample size

Lean Programming Language

baseline 합치기

baseline graph mi 랑 fisher 로? baseline 이라고 선정하지 말고 non-steered 라고 명시

after fine tuning we don’t need baseline complementary

add some explanation for subcircuit between corrsteer-a and corrsteer-1

Multi-layer Superiority: feature collaboration - remove second sentence with better startign

progress for SAE feature ablation

퍼센트 없에고

causality

fine tuning other paper - we tried based on our knowledge

we don’t need SAE on runnging time

mention zero shot

Observations: textbf and remove:

corrsteer-1 -All -Prune naming

gemma-2 2b

clarify we don’t need SAE are required during inference since it is static

O(1) → O(n)

샘플 수 문제 progress ref 보여주며 100정도 적은 샘플에서도 잘 작동했으며 4000이 넘은 이후에는 유의미한 증가는 안보엿지만 간단히 적용하기 좋은 방법론. 1000으로 실험한 gsm8k 같이 적은 샘플이나 108 harmbench에서 특히 corrsteer-a 에서 varieance 가 높은걸로 보아 가끔 correlation 계산에 부족했음을 의미하고 4000 정도 이상을 추천한다

why it works linear 더 좋음을 보인다 bbq ambig 빼고는 더 나았다.

Title

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

CorrSteer: Inference Time Steering Based on SAE Feature Correlation Improves Performance and Safety

실험이나 노력 필요

XSTest 변화량 보기

llama appendix feature 정리

mech interp 리뷰 반영

실험 3개나 5개 해서 statistic?

prompt engineering

Capability Mitigation - reverse rewarding

feature ablation only feature that provided on the paper

knock in knock out

feature interaction could be explained by comparison between the found feature of corrsteer-1 and corrsteer-a is different for collaborate

안댐 지금 static 방식이고 training with gather 아니라서

system diagram

structeval?

mmlu domain cross ability

cross domain

reduce the empty space

Simple is simple method, but hard to think and important milestone that mechanistic interpretability research should go forward for not only interperbatility

feature coeff 에는 positive 만 필요하고

feature selection 에는 multi smple correlation 이 필요하지 양의 activation masking 의미적다

you can make the SAE logit 5 to 3 or 5 to 1, but you can’t make it 1 to -1 since it results unpredictable resutls.

Tiele & Abstract

better than MI based extraction or fisher matrix based

일종의 하나의 분야를 여는 논문을 무슨 논문이라그래

original reserach for general steering 이라고 harmbench 역 reward 적용시 어케나오나

ICLR

We do not provide head-to-head comparison with CAA (Rimsky et al., 2024), FGAA (Soo et al., 2025), or PaCE (Luo et al., 2024), since these methods require contrastive datasets and steer context-hidden-state activations rather than generation-time features. CorrSteer instead operates exclusively at inference time, making it complementary rather than directly competing with these approaches

baseline CAA mutual information 등으로 한거랑 비교하거나

generation time activation 과 multi token multi layer steering 한 최초 사례 강조문장하나

To the best of our knowledge, CorrSteer represents the first approach to leverage generation-time activations for multi-token, multi-layer SAE-based steering, and our experiments are uniquely enabled by Gemma Scope and LLaMA Scope, the only open releases providing SAEs across all residual stream layers, thereby ensuring both methodological novelty and evaluation diversity

gemma scope 랑 llama scope 만이 모든 레이어 sae 공개된 모델이라서 이거 둘다 사용해서 실험 다양성 확보했다.

IDTA 가지고 실시간 sparse feature 뽑아서 실시간 computing 적음도 fine tuning 보다 높은 성능 보이기

Existing steering approaches rely on contrastive examples, which are limited to static token contexts. In contrast, CorrSteer directly leverages generation-time activations, extending SAE steering beyond contrastive or context-only settings and achieving practical improvements across QA, safety, and bias benchmarks.

Existing steering approaches rely on contrastive examples restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging generation-time activations, extending SAE-based steering and achieving practical gains across QA, safety, and bias benchmarks.

title 좀더 디테일하고 abstract result 더 적기 - interpreatablity or side effect by crossinter disiplaitnty

abstract 살짝 줄이기 한두줄

Method

generation time token 들에 넣어준다는게 핵심이고 이 steering 위치의 근거 또한 context 와 generation 당시 feature extraction 과 동일하기에 더 faithful 한 steering 이 이전과 다르다

3가지 방법론 모두 장단점이 있고 SER 나 computing 에 따라나 목적에 따라 있고 자세한건 results section 에서 다뤡ㅆ다.

O(1) 정확히는 고정된 L와 D 인 O(LD) 이고 sample size 에는 memory complexity 없어서 scaling 가능하고 activation store 필요없다. 또한 inference 시에는 SAE 의존성이 없고

coefficient 도 마찬가지로 실제 데이터 근거로 사용하며 fully 자동화라 hyperparamter tuning 필요없는것도 장점

Experiments

SER accuracy 를 PEFT, SFT 랑 비교

llama 는 안될거같으면 그냥 포기하자 너무 불안정해 모델이 시간도 없고

cross-benchmark transfer (mmlu/mmlupro, bbq/emgsd)

all pooled activation 할때랑 neg 할때 side effect 올라감 보이기

negative coreation 은 애초에 말안되는게 sae 자체가 양수만 뽑을 뿐더러, 혹여나 feature 가 이미 활성화시에는 음수활성화가 효과적일 수는 있지만 활성화 없을시 음의; 방향이 어떤 side effect 영향을 끼칠지 모르기 때문에 안했다. 더불어 CorrSteer-E method 에서도 negative correlation 으로 빼 봤는데 오히려 성능을 낮추었다. single layer 에서는 가끔 동작했지만

notion table 만들고 그걸로 그냥 직접 생성 → graph 2개 재생성 뺄거빼기

merge 0 - 200 도 나눠서 40 으로 한걸로 그래프 교체 더 예쁘면

safe/unsafe tentdnecy, task-wise analysis and feature analysiss

회사에서 찍은 사진 → corrsteer diagrma (핵심은 brain10%) → x.com

task frequency / original density = task specificity or task generalability

No discussion of broader implications for AI safety

emgsd 재현하고 로 뽑은 feature 를 bbq 에 해본다면 concept 가 왜냐면 sae 가 topic model 에 가깝기 때문에 당연한 접근

ser 을 emgsd 제대로 한다음 CAA, ActAdd/ 랑 비교했을때 좋다고

raw activation 도 생각해보이 가능할듯

Finally, the SER is compared when pooling on every token rather than the inference-time generated token. 이거도 해야

referencing frequency, emgsd more detials in the appendix

Limited discussion of scalability to larger models (2B → 8B)

deepseekmath 가 math 에만 학습한거맞나

fine tuning comparison 디테일하게

pregress image 언급하며 constrained decoding 도 넘어선다고

generation-time features가 더 causal하다"는 주장의 근거 부족

appendix CorrSteer-P bold, see some features

appendix 에 attidiotnal feature analysis 추가

Results

llama coeff 계산 다시보고, 용어 task circuit 이라고 부르자 하나 레이어라고보자, 왜냐면 circuit 은복잡얽힘

llama 가 많은 layer steer 할수록 ser 가 적고 gemma 는 반대 경향 보인건 dictionary size 로 보인다 single 이 monosemnativity 적어서

Intererstingly, adding with bias of the steering made the SER very low. Hypothesizely this is due to the act as a additional attention sink \citep{goodfire} and add more norm to capable more attention to this token which improves correctness overall.

interpreatbility

Multi layer stereing 결과가 좋았다는건 (ICV, SPARE) 에서 좋았던거 동일시 하지만 task 별로 다르다

generation-time activations를 반복적으로 강조하는 빼기

constrained decoding, SER 정의)이 appendix로 밀려도 될 정도로 본문 집중도를 해침

실제 “industrial deployment”로 이어질 정도의 robustness evidence는 부족.

Discussion

pearson correlation is the unit of human interpretable feature since it reflects linear pattern

the main problem of ai steering is robustness result for any industrial use cases. That is why we need AI control and precise steering. Whatever the prompt based control of representation based control.

LLM sae latente sapce 도 마찬가지인데 interfere 하나 넣는다고 크지않음. 그래서 레이어별로 feature direction 을 하나씩 끼워넣어줫더니 좋았고

인류가 뇌의 10프로를 사용한다는 건 sparse 하게 diverse feature 를 선택적으로 activation 하기 때문입니다. 하지만 그 한계를 살펴보면 부분별로 조금씩은 더 에너지를 사용해도 interfeferncace 하지 않습니다.

we 로 시작하는거 줄이기

Conclusion

contrastive pair 없이 CorrSteer로 steering의 직접적인 최적화 대상을 단지 정답여부로 해결해 더 일반화했다

SER metric의 정의는 직관적이지만, causal contribution과 단순 상관을 구분하는 방법이 부족. (성능향상이면 무조건 좋다)

Images

palette 로 레이어별 top 10 feature 점선으로 보여주면 죽일듯 topk 로 새로선 짤벡 이어져야하고 아래 연해야 global top 은 빨게야함 frequncy 랑 바뀐 example 수를 영향력과 coeff 도 시각화가능 하 미쳤다 각각 나누고 다른 task 는 appencdix 주고

Future works

Prompt Engineering? 과의 비교 SER 높을걸로 예상

GSM8K reasoning에서 성능 저하를 보이는데, 이에 대한 원인 분석은 있지만 해결첵 dymnamic

앞으로는 feature filtering 시 기존 이미 activated feature 들에 projection 한 부분을 빼고 steering 해서 side effect 를 minimize 할 수 도 잇을 겁니다

그냥 global top10 은 어떤가 same laayer, global 이 사실 layer 고 foreach 가 사실 foreach 고 filter methdo 가 나은듯

TIGER-Lab/StructEval

Missing

No mention of potential negative societal impacts

Could benefit from more concrete recommendations for practitioners

Better

통계적 유의성 검정 추가 (t-test, bootstrap 등)

SAE non-negative activation 주장에 대한 근거 제시

다른 SAE steering 방법들과의 성능 비교

실험 세부사항 명확화

Images

Coefficient

Correlation

Frequency

Examples

Accuracy

Interpretability

Transferability

Method

global steering CorrSteer-G

single layer steer (bias 언급) CorrSteer-L

selective steering (validation) CorrSteer-S

SteerVR 좋긴한데 correaltion 이 빠져서 verifiable reward 랑 연결된다 언급정도만 하면 될듯

Exp

HarmBench → BBQ/XSTest Transfer

to xstest is bad


# HarmBench negative features → BBQ disambig
python eval.py manual --feature_file=checkpoints/gemma2b_harmbench_global_features.json --method=global --task=bbq --filter_value=disambig --example

# HarmBench negative features → BBQ ambig  
python eval.py manual --feature_file=checkpoints/gemma2b_harmbench_global_features.json --method=global --task=bbq --filter_value=ambig --example

# HarmBench negative features → XSTest
python eval.py manual --feature_file=checkpoints/gemma2b_harmbench_global_features.json --method=global --task=xstest --example

BBQ → HarmBench/XSTest Transfer


# BBQ disambig negative features → HarmBench
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_disambig_features.json --method=global --task=harmbench --example

# BBQ disambig negative features → XSTest
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_disambig_features.json --method=global --task=xstest --example

# BBQ ambig negative features → HarmBench
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_ambig_features.json --method=global --task=harmbench --example

# BBQ ambig negative features → XSTest
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_ambig_features.json --method=global --task=xstest --example

XSTest → HarmBench/BBQ Transfer


# XSTest negative features → HarmBench
python eval.py manual --feature_file=checkpoints/gemma2b_xstest_global_features.json --method=global --neg=True --task=harmbench --example

# XSTest negative features → BBQ disambig
python eval.py manual --feature_file=checkpoints/gemma2b_xstest_global_features.json --method=global --neg=True --task=bbq --filter_value=disambig --example

# XSTest negative features → BBQ ambig
python eval.py manual --feature_file=checkpoints/gemma2b_xstest_global_features.json --method=global --neg=True --task=bbq --filter_value=ambig --example

MMLU → MMLU-Pro/BBQ Transfer


# MMLU features → MMLU-Pro
python eval.py manual --feature_file=checkpoints/gemma2b_mmlu_global_features.json --method=global --task=mmlupro --example

# MMLU features → BBQ disambig
python eval.py manual --feature_file=checkpoints/gemma2b_mmlu_global_features.json --method=global --task=bbq --filter_value=disambig --example

# MMLU features → BBQ ambig
python eval.py manual --feature_file=checkpoints/gemma2b_mmlu_global_features.json --method=global --task=bbq --filter_value=ambig --example

MMLU-Pro → MMLU/BBQ Transfer


# MMLU-Pro features → MMLU         
python eval.py manual --feature_file=checkpoints/gemma2b_mmlupro_global_features.json --method=global --task=mmlu --example 

# MMLU-Pro features → BBQ disambig         
python eval.py manual --feature_file=checkpoints/gemma2b_mmlupro_global_features.json --method=global --task=bbq --filter_value=disambig --example      

# MMLU-Pro features → BBQ ambig         
python eval.py manual --feature_file=checkpoints/gemma2b_mmlupro_global_features.json --method=global --task=bbq --filter_value=ambig --example

BBQ → MMLU/MMLU-Pro Transfer


# BBQ disambig features → MMLU
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_disambig_features.json --method=global --task=mmlu

# BBQ disambig features → MMLU-Pro
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_disambig_features.json --method=global --task=mmlupro

# BBQ ambig features → MMLU
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_ambig_features.json --method=global --task=mmlu

# BBQ ambig features → MMLU-Pro
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_ambig_features.json --method=global --task=mmlupro

Negative

49.45% 466/542 = 0.8598

13.72% 194/196 = 0.9898

12.15%


# MMLU negative features
python eval.py manual --feature_file=checkpoints/gemma2b_mmlu_global_features.json --method=global --neg=True --task=mmlu --example

# MMLU-Pro negative features
python eval.py manual --feature_file=checkpoints/gemma2b_mmlupro_global_features.json --method=global --neg=True --task=mmlupro --example

# BBQ disambig negative features
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_disambig_features.json --method=global --neg=True --task=bbq --filter_value=disambig --example

# BBQ ambig negative features
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_ambig_features.json --method=global --neg=True --task=bbq --filter_value=ambig --example

# HarmBench negative features
python eval.py manual --feature_file=checkpoints/gemma2b_harmbench_global_features.json --method=global --neg=True --task=harmbench --example

# XSTest negative features
python eval.py manual --feature_file=checkpoints/gemma2b_xstest_global_features.json --method=global --neg=True --task=xstest --example

# SimpleQA negative features
python eval.py manual --feature_file=checkpoints/gemma2b_simpleqa_global_features.json --method=global --neg=True --task=simpleqa --example


# MMLU negative features
python eval.py manual --feature_file=checkpoints/gemma2b_mmlu_global_features.json --method=single --neg=True --task=mmlu --example

# MMLU-Pro negative features
python eval.py manual --feature_file=checkpoints/gemma2b_mmlupro_global_features.json --method=single --neg=True --task=mmlupro --example

# BBQ disambig negative features
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_disambig_features.json --method=single --neg=True --task=bbq --filter_value=disambig --example

# BBQ ambig negative features
python eval.py manual --feature_file=checkpoints/gemma2b_bbq_global_ambig_features.json --method=single --neg=True --task=bbq --filter_value=ambig --example

# HarmBench negative features
python eval.py manual --feature_file=checkpoints/gemma2b_harmbench_global_features.json --method=single --neg=True --task=harmbench --example

# XSTest negative features
python eval.py manual --feature_file=checkpoints/gemma2b_xstest_global_features.json --method=single --neg=True --task=xstest --example

# SimpleQA negative features
python eval.py manual --feature_file=checkpoints/gemma2b_simpleqa_global_features.json --method=single --neg=True --task=simpleqa --example

Raw


 # MMLU
python train.py train --model=gemma2b --task=mmlu --layer=global --raw

# MMLU-Pro  
python train.py train --model=gemma2b --task=mmlupro --layer=global --raw --select_token

# BBQ disambig
python train.py train --model=gemma2b --task=bbq --layer=global --raw  --filter_value=disambig

# BBQ ambig
python train.py train --model=gemma2b --task=bbq --layer=global --raw --filter_value=ambig

# HarmBench
python train.py train --model=gemma2b --task=harmbench --layer=global --raw

# XSTest
python train.py train --model=gemma2b --task=xstest --layer=global --raw

# SimpleQA
python train.py train --model=gemma2b --task=simpleqa --layer=global --raw

Pooling


# MMLU
python train.py train --model=gemma2b --task=mmlu --layer=global --pool=mean

# MMLU-Pro  
python train.py train --model=gemma2b --task=mmlupro --layer=global --pool=mean --select_token

# BBQ disambig
python train.py train --model=gemma2b --task=bbq --layer=global --pool=mean --filter_value=disambig

# BBQ ambig
python train.py train --model=gemma2b --task=bbq --layer=global --pool=mean --filter_value=ambig

# HarmBench
python train.py train --model=gemma2b --task=harmbench --layer=global --pool=mean

# XSTest
python train.py train --model=gemma2b --task=xstest --layer=global --pool=mean

# SimpleQA
python train.py train --model=gemma2b --task=simpleqa --layer=global --pool=mean

Mask all


# MMLU
python train.py train --model=gemma2b --task=mmlu --layer=global --mask=all

# MMLU-Pro  
python train.py train --model=gemma2b --task=mmlupro --layer=global --mask=all --select_token

# BBQ disambig
python train.py train --model=gemma2b --task=bbq --layer=global --mask=all --filter_value=disambig

# BBQ ambig
python train.py train --model=gemma2b --task=bbq --layer=global --mask=all --filter_value=ambig

# HarmBench
python train.py train --model=gemma2b --task=harmbench --layer=global --mask=all

# XSTest
python train.py train --model=gemma2b --task=xstest --layer=global --mask=all

# SimpleQA
python train.py train --model=gemma2b --task=simpleqa --layer=global --mask=all

Gsm8k


# Pruned: 4 selected features
python eval.py multi_feature_steering --features="[(5,28164,0.0075),(8,17066,0.0181),(11,19183,0.0600),(19,19557,0.0433)]" --task=gsm8k --model=llama8 --example --few=1

# Global: all layers with features
python eval.py multi_feature_steering --features="[(1,0,0.01),(2,0,0.01),(3,0,0.01),(4,0,0.01),(5,28164,0.0075),(6,0,0.01),(7,0,0.01),(8,17066,0.0181),(9,0,0.01),(10,0,0.01),(11,19183,0.0600),(12,0,0.01),(13,0,0.01),(14,0,0.01),(15,0,0.01),(16,0,0.01),(17,0,0.01),(18,0,0.01),(19,19557,0.0433),(20,0,0.01),(21,0,0.01),(22,0,0.01),(23,0,0.01),(24,0,0.01),(25,0,0.01),(26,0,0.01),(27,0,0.01),(28,0,0.01),(29,0,0.01),(30,0,0.01),(31,0,0.01)]" --task=gsm8k --model=llama8 --example  --few=1

# Single: highest correlation feature
python eval.py multi_feature_steering --features="[(5,28164,0.0075)]" --task=gsm8k --model=llama8 --example  --few=1

Decode


# MMLU
python train.py train --model=gemma2b --task=mmlu --layer=24 --decode

# MMLU-Pro  
python train.py train --model=gemma2b --task=mmlupro --layer=25 --decode --select_token

# BBQ disambig
python train.py train --model=gemma2b --task=bbq --layer=17 --decode  --filter_value=disambig

# BBQ ambig
python train.py train --model=gemma2b --task=bbq --layer=17 --decode --filter_value=ambig


 # MMLU
python train.py train --model=gemma2b --task=mmlu --layer=global --decode        

# MMLU-Pro  
python train.py train --model=gemma2b --task=mmlupro --layer=global --decode --select_token

# BBQ disambig
python train.py train --model=gemma2b --task=bbq --layer=global --decode  --filter_value=disambig

# BBQ ambig
python train.py train --model=gemma2b --task=bbq --layer=global --decode --filter_value=ambig

gemma-bbq-ambig

gemma-bbg-disambig

gemma-harmbench

gemma-mmlu

gemma-mmlupro

gemma-gsm8k

gemma-simpleqa

gemma-xstest

llama-bbq-ambig

llama-bbq-disambig

llama-harmbench

llama-mmlu

llama-mmlupro

llama-simpleqa

llama-xstest

🚀 New paper drop!

Our method, CorrSteer, boosts performance on both LLaMA-3.1 8B and Gemma-2 2B-IT.

We ran extensive ablations:

Generation-token vs all-token pooling

Raw activation vs SAE activation

Mean vs Max strategies

Multi-layer vs single-layer steering

👉 The key insight: generation-time token correlation drives performance.

Beyond performance, CorrSteer is interpretable AI Control: it uncovers underlying objectives and reveals the required capabilities that drive task performance.

🔒 For example, on HarmBench, in the LLaMA model, safety-related features were extracted in most layers.

⚖️ For the bias benchmark BBQ, unlike expectation, neutrality-focused features turned out to be most helpful. Interestingly, features that looked too directly related appeared with negative correlation, suggesting that activation of meta-cognitive recognition features may hurt task performance.

🧮 Very interestingly, math features were selected as top correlated in almost every task, meaning math is important even in unexpected datasets. This indirectly supports DeepSeekMath, which showed that math-focused corpora can improve performance across diverse tasks.

On BBQ Ambig, CorrSteer changed only the 1,532 wrong answers while changing 0 correct ones untouched, minimizing the Side Effect Ratio (SER) and showing that representation-level steering could be safer than fine-tuning.

Existing steering approaches rely on contrastive examples, which are limited to static token contexts. In contrast, CorrSteer directly leverages generation-time activations, extending SAE steering beyond contrastive or context-only settings and achieving practical improvements across QA, safety, and bias benchmarks.

diagram?

We believe scalable, interpretable SAE steering can improve both performance & safety.

I’m also open to Researcher / Research Fellow positions in London (offline), feel free to reach out 📩

Slack/Discord

Hey everyone! We found that inference-time SAE features strongly correlate with correctness, enabling fully automated steering without manual tuning.

📈 MMLU: +4%

🛡️ HarmBench: +23%

We ran extensive ablations (generate-time token vs all-token, raw activation vs SAE activation, mean vs max, multi vs single-layer) show test-time tokens correlation is key.

Compared to fine-tuning, CorrSteer achieves lower Side Effect Ratio (SER) and interpretable steering (math, safety, bias), revealing the underlying capabilities that drive task performance.

Paper

Demo

I’m open to Researcher/Fellow roles in London, feel free to connect!

https://arxiv.org/abs/2506.14866#:~:text=To address this gap%2C we,injection attacks%2C and model misbehavior.

Message

Alan Sun

Karen

~~Faithful Team~~

I am truly proud of this work effectiveness from simple idea: "Correlating on Test-time features” work with @Zekun Wu! 🚀

We found that inference-time SAE features strongly correlate with benchmark correctness, enabling fully automated steering without manual tuning.

Our method, CorrSteer, achieves consistent gains on both LLaMA-3.1 8B and Gemma-2 2B:

📈 MMLU: +4%

🛡️ HarmBench: +23%

Through extensive ablations (generation-token vs all-token, raw activation vs SAE activation, mean pooling vs max pooling), we show that generation-token correlation is the key driver.

Beyond performance, CorrSteer is interpretable: selected features align with coherent concepts (math, safety, bias). We also introduce a new metric, SER (Side Effect Ratio), to compare fine-tuning against representation-level steering.

Daniel Tan

LessWrong 좀있다가 이건 method 는 간략하게 핵심만 설명하고 interpretability 위주로

Blackbox nlp 되고나서 reddit lesswrong 홍보

Machine Learning

Beginners -> /r/mlquestions or /r/learnmachinelearning , AGI -> /r/singularity, career advices -> /r/cscareerquestions, datasets -> r/datasets

https://www.reddit.com/r/MachineLearning/

CorrSteer Paper plan

Title

실험이나 노력 필요

Simple is simple method, but hard to think and important milestone that mechanistic interpretability research should go forward for not only interperbatility

Tiele & Abstract

Method

Experiments

Results

Discussion

Conclusion

Images

Future works

Missing

Better

Images

Method

Exp

Negative

Raw

Pooling

Mask all

Gsm8k

Decode

Tweet

Slack/Discord

Message

Reddit

Recommendations