CRL Paper Plan

change Motivation to attribution grpah

그리고 attribution graph 같은거 너무 복잡하다 human interpretblae - single control

inference time alignment

성능향상 tracking 방식 제안 model diffing 느낌인가

figure font size

llama gsm8k 없어도 되고왜냐면 architectural generarlization 만 보여주면 됨 single famility 넘어서 multiple architecture tested

layer wise steerirng result 하고 해상도

crl corrsteer 같은 feature 발견하나 확인 - 이건나중에

더해진 feature correlation 도 포함

jumprelu? 정확구현확인

selective performance llama

system diagram 에서 crl token layer 둘다 direction?

이미지 화질 교체

논문 방향 싱글레이어로 가되, 전체가능하던 심플 태스크 성능 언급 89 등

mathmatical representation 은 background 랑 method 두개통일 sae

layer all shared markov decision process ^{ell} 로 구성하기

method onehot 짜치니 argmax topk 하고 1 으로

feature diversity 는 policy layer depth 늘어날수록 적었고 critic loss 도 critic layer depth 작을수록 좋았다.

Option

Correlation Steering Warmup

—epsilon

epsilon 이랑 act 랑 masing 적용위치 및 순서

jumprelu

조금 낮아짐
근데 마스킹하고 같이하니 78 달성 뭐지 ㄷ
multi feature or single feature
threshold initizlization
stage 1 에도 할지

—q

변함없음

loss softmax

보통 낮은데

(—grpo)

더 효과적인 sparse selection 구조

token decay 혹은 그냥 correation 더하기보다 곱하기 - 성능 100 으로 73한거 유지로 hyperparameter 삭제로 좋다, feature 동일사용은 여전히 같다.

음수 corr 음수 logit 경우 고려해야하나

실데 스티어링 corr 업에이트할때 선택한 sae feature 랑 더해진 coeff 로 corr 계산해야함

activation decay 로 steering 0.99 나 0.95 로 줄여나갈까 - 성능유지는 했는데 토큰 길이 짧아서 별의미없 73.21

현재토큰 활성화 중에서만 masking 하면 되잖아 성능만 제발 유지되면 encode 중에서

혹은 현재꺼 반대 마스킹 새로운거 더하기위해
correlation 곱해주는 곳에다가 1 아니면 corr 이렇게 해도 되고

1. Gumbel Softmax + Top-K Selection

성능 떨어짐

2. Sparse Attention Mechanism

성능유지

3. Straight-Through Estimator (STE)

성능유지

4. Learnable Sparse Gates

성능유지

1. Gumbel Softmax: 미분 가능한 discrete sampling

2. Sparse Attention: correlation을 attention weight로 활용

3. STE: discrete selection + continuous gradients

or simply

Token-wise context-dependent correlation → decreased

Token position linear freedom → same (후반강조는 오히려 낮아지고)

Attention-based correlation weighting → same

Learnable mixing parameter → same

Prompts

Target

1:45분 수정

CRL Paper Plan

Option

or simply

Prompts

Target

Current

Recommendations