Neuronpedia Single token ratio

초 후반 레이어 많고 중반레이어 적다로 neuronpedia api 만 활용한 논문

api 말고 db access

가설 검증하는 short paper

서로 다른 모델/SAE set 사이에서 single-token feature subspace의 alignment, 혹은 레이어별 feature군의 정렬/불일치가 다운스트림에 미치는 영향(steering/transfer

여기서 single-token-ness를 ‘concept monosemanticity의 극단’(혹은 token-level concept의 한 종류)으로 정의

how to judge this feature is single token feature or not

let chatgpt deep research to literature review

Elhage et al., 2022 (Transformer Circuits, preprint)