CoorsSteer Paper

흥미로운 점은 음의 coefficient 와 양의 coefficient 가 일반저긍로 여겨지는 task 의 직관과 일치했다는 것이다. 이를 통해 단순한 트레이닝이 feature 단위로 steering 할 때에 sae 가 양의 activation 뿐만 아니라 음의 activation 도 고려하거나 특정 방향의 제거로 인한

spurious correlation 위해 그리고 overfitting 막는것처럼 인과가 반대일 수 있기 때문에 global method 일경우 제일 좋은 걸로 했고 foreach 모드일때는 기존 validationset 보다 성능 올라간것만 feature filter 통과하여 최종에 전달한다.

만약에 xs 같은 context dependt task 에서는 적용이 안된다면 이것은 context 에 따라 더 다양한 feature 를 선택해야한다는 말이 된다. context중요도에서는 dynamic feature selection 이 핵심이라는 말

반대로 dynamic selection 했을 때 훨씬 더 좋아진다는 task 가 context dependent 라면 이것을 증명하는 꼴

classification 과 benckmark 다른점은 genearation 만반영한다는점인데 그게 spirious 를 더 먹을수있기 때문( 해보기 mmlu all)

simpleqa 만 안좋으 이유는 이 sae feature 가 실제로 exteranal knoweldge 를 들여오기보다 inner knowledge 나 태도를 강화하는 task 에 최적화된 방향을 찾는건데 simpleqa 에서는 말그대로 지식에 대한 fidelity 였나 그거를 측정하는거라 거의 영향이 없었다.

only using test-time features

Visualization

expected value

Future works

Knowledge System (Wikipedia) is All interpretability needs

gpt 2 랑 wiki id 로 dictionary learining shuffle 해다 feature 만 고정하거나 따로 loss 로 warmup 이후 training 에도 feature 유지되는지 interpretability 중요하다. correlation matters due to the human only senstivive to linear relation interpratbille at

sae 가 사실상 역행렬이니 역행렬 loss 적용하면?

tied sae 가 sym sae 는 아닌듯 transpose