DeepSeeMoE

Key Questions

MoE is not sufficiently interpretable in terms of relation between dictionary features and experts and router

Comparing between separately trained experts and not

Objective: 어떤 성능을 높이는건지, 벤치마크를 어떻게 잡아야할지

(Hard to post-train because it’s too sensitive to choose which experts and how router should be tuned via LoRA. Feature could be hint.)

Background & Related Works

https://transformer-circuits.pub/2022/toy_model/index.html#demonstrating

LLM is polysemantic, thus making it difficult to understand how neural networks process information and how to intervene in representation features
However, the polysemanticity of neurons has the advantage that we can use fewer neurons to represent more concepts.
In this way, features are no longer orthogonal and thus interfere with each other, but this problem would seem to be mitigated by nonlinear functions
especially, MoE is one of the reason to make the FFN sparse

https://transformer-circuits.pub/2023/monosemantic-features/index.html

SAEs offer a promising route to interpretabil- ity by learning sparse dictionaries that reconstruct intermediate activations and surface monosemantic features

‣, ‣

Transcoders have recently been proposed to approximate dense MLP sublayers with wider, sparsely-activating replacements that better expose circuit structure

Also, crosscoder which is ~, came out for representing common features across layer.

https://arxiv.org/pdf/1701.06538, https://arxiv.org/pdf/2401.06066, ‣
BTM, BTX, FlexOlmo

https://arxiv.org/abs/2208.03306, https://arxiv.org/abs/2403.07816, https://arxiv.org/pdf/2409.02060, ‣

Related Works

Methodology

SAE: Cross-layer [

attn → router → expert(k) → sum → residual

ㄴ Transcoder ㅡI

ㄴ SAE

SAE(residual-stream) → expert feature 로 해석할 수 있음,
Transcoder(attn.output, [router] expert.sum.output) → router feature ]
Question

How can we define using cross-layer is better?
How can we explain justification of selected layer and previous number or layers
Should we do other experiments for this justification

Dataset: DCLM

Models: GPT-OSS 20B, DeepSeekMoE, FlexOlmo

Explanation of why these models were selected

Experiments

So Far

Used 1 % of DCLM (local-shard-01) containing approximately 100000 rows

filtered with text length under 5200 which is mean len of the shard

Trained SAE based on this dataset and get first 30000 rows
Used Basic SAE for testing with ‣

Since, transcoder_lens and sae_lens only support models they already integrated with
Also, do not support crosscoder or cross-layer SAE

Some Engineering

Since dataset is too large, used stream method in huggingface datasets and fixed some codes of sparsify for it
Regarding the model size, the LLM model and SAE models couldn’t be on the 80GB memory in A100 at once. Modified some code to support initialize LLM model on cuda:0 and SAE on cuda:1 and transfer its data between

Future Works

Cross-layer SAE
Cross-layer Transcoder
Add deepseekmoe case

Results

So Far

Next Plan

cross-layer training gptoss, flexolmo DCLM 10%

layer 확정, 얼만큼의 주변 layer 개수
loss wandb ⇒ batchtopk (학습하기 전에 setting 값 전달)
각 feature 별 명확한 description ⇒ llm 으로 가져오고 → 해당되는 expert mapping
약간의 visualization

왜 MoE 는 Hard to post-training 인지, LoRA

Future Works

Evaluation of SAE itself

Visualization

Application on post-training

Paper Draft

16_DeepSeeMoE.pdf

110.5 KiB

DeepSeeMoE

Key Questions

Background & Related Works

Methodology

Experiments

Results

Next Plan

Future Works

Paper Draft

Recommendations