최종 lesswrong first post

The decoder weight is directly affected by L2 reconstruction loss which force decoder weight to utilize features as much as possible However, encoder matrix are pressured by to sparsity L1 loss of feature vector which prevents the weight to represent features enough so it “shrinks” the representation ability as above. The same explanation can be applied to the neuron similarity for each weight matrix. Since encoder more focuses on neuron than sparse feature, it has repesent more about the neuron. Same for the decoder focuses more about the feature, relatively less attention to the neuron itself Overall less cosine similarity lies on two reasons. First the dimension of weight vector of neuron is dictionary size which cause curse of dimensionality to reduce a possibility of same direction. Also, a lot of orphan feature separated by SAEs cause mismatch between vectors

Universality Hypothesis(Chughtai et al., 2023; Bricken et al., 2023).

최종 lesswrong first post

Recommendations