CRL Corrsteer Sync

IDTA 가지고 실시간 sparse feature 뽑아서 실시간 computing 적음도 fine tuning 보다 높은 성능 보이기

1:45분 수정

더 효과적인 sparse selection 구조

token decay 혹은 그냥 correation 더하기보다 곱하기 - 성능 100 으로 73한거 유지로 hyperparameter 삭제로 좋다, feature 동일사용은 여전히 같다.

음수 corr 음수 logit 경우 고려해야하나

실데 스티어링 corr 업에이트할때 선택한 sae feature 랑 더해진 coeff 로 corr 계산해야함

activation decay 로 steering 0.99 나 0.95 로 줄여나갈까 - 성능유지는 했는데 토큰 길이 짧아서 별의미없 73.21

1. Gumbel Softmax + Top-K Selection

성능 떨어짐

2. Sparse Attention Mechanism

성능유지

3. Straight-Through Estimator (STE)

성능유지

4. Learnable Sparse Gates

성능유지

1. Gumbel Softmax: 미분 가능한 discrete sampling

2. Sparse Attention: correlation을 attention weight로 활용

3. STE: discrete selection + continuous gradients