SAEs offer a promising route to interpretabil-
ity by learning sparse dictionaries that reconstruct intermediate activations and surface monosemantic features
‣, ‣
Transcoders have recently
been proposed to approximate dense MLP sublayers with wider, sparsely-activating replacements that better expose circuit structure
‣
Also, crosscoder which is ~, came out for representing common features across layer.
How can we explain justification of selected layer and previous number or layers
Should we do other experiments for this justification
Dataset: DCLM
Models: GPT-OSS 20B, DeepSeekMoE, FlexOlmo
Explanation of why these models were selected
Experiments
So Far
Used 1 % of DCLM (local-shard-01) containing approximately 100000 rows
filtered with text length under 5200 which is mean len of the shard
Trained SAE based on this dataset and get first 30000 rows
Used Basic SAE for testing with ‣
Since, transcoder_lens and sae_lens only support models they already integrated with
Also, do not support crosscoder or cross-layer SAE
Some Engineering
Since dataset is too large, used stream method in huggingface datasets and fixed some codes of sparsify for it
Regarding the model size, the LLM model and SAE models couldn’t be on the 80GB memory in A100 at once. Modified some code to support initialize LLM model on cuda:0 and SAE on cuda:1 and transfer its data between
Future Works
Cross-layer SAE
Cross-layer Transcoder
Add deepseekmoe case
Results
So Far
Next Plan
cross-layer training gptoss, flexolmo DCLM 10%
layer 확정, 얼만큼의 주변 layer 개수
loss wandb ⇒ batchtopk (학습하기 전에 setting 값 전달)
각 feature 별 명확한 description ⇒ llm 으로 가져오고 → 해당되는 expert mapping