CorrSteer 리서치 노트 2

각 feature 별 몇개하고 말고도 전체 합쳐서 distirbtuion

가로 세로 scatter plot feature 평균 분산

max pooling 말고 mean 이나 흠 - 어차피 correlation 이라 상관없길할듯 틀린거도 보니 방법도 없고

내 ipynb activation visualization viz scripts for each layer

중위값하고 activation density (fraction of nonzero) stat return

평균이 중간값보다 큰놈들 비율 확인

token position 마다 다른지

max_token 늘일수록 correlation 늘어나는건지 왜냐면 max pooling 이라

mean std 같이 표현해주기

notbook to python script - corrsteer and convert and dashboard

dashboard 는 file option and folder option

weight compare cos similarity

I focused on verifying the idea about "Different SAEs share some common features." I did two experiments to show this by comparing previous research:

Comparing features across SAEs trained on different datasets

Comparing features across SAEs trained on different seeds

Based on the experiment results, I found three interesting properties of SAE features:

The dataset matters more than seed differences.

The shared features across SAE models were lower than expected but still existed with different training.

Method

I computed cosine similarity per SAE feature's weight to compare features with different seeds or datasets on the same Language Model. I was surprised that previous research also used cosine similarity and applied the Hungarian Algorithm to match features 1:1, which I didn't apply. So my method is a little bit optimistic and strict compared to it, but the above images still show the trend.

Results

The top-1 cosine similarity for features across SAEs showed many significantly large cosine similarities (right). However, not every feature was universal since top-n cos similarity was not very low, which indicates the existence of superpositioned features. When I trained on different datasets (1e8 tokens on TinyStories vs. 1e9 tokens on OpenWebText), the feature similarity was much lower than the seed test.

Nevertheless, there were still high cosine similarity features across SAEs, and dataset characteristics also affected the results. I didn't compute the exact ratio of features shared across two models, but the previous research indicated that 30% of features (with a 0.7 cosine similarity threshold) and dataset differences might cause a lower percentage.

Conclusion and future works

Common features across SAEs ratio is lower than I expected, but I definitely showed their existence. Furthermore, I agree with the statement: "We conjecture that there is some idealized set of features that dictionary learning would return if we provided it with an unlimited dictionary size." That size may be much larger than this experiment that treats around 2000 features, so I think that's why it's affected by seeds. I suspect the underlying reason is that they find different subsets of "idealized set of features" because of weight initialization and small dictionary size. So after the experiment, examining the common feature ratio difference by scaling dictionary size may be a great step forward.

CorrSteer 리서치 노트 2

Method

Results

Conclusion and future works

Recommendations