Loading views...

DeepSeeMoE

Key Questions

  1. MoE is not sufficiently interpretable in terms of relation between dictionary features and experts and router
  1. Comparing between separately trained experts and not
  1. Objective: 어떤 성능을 높이는건지, 벤치마크를 어떻게 잡아야할지
  1. (Hard to post-train because it’s too sensitive to choose which experts and how router should be tuned via LoRA. Feature could be hint.)

    Background & Related Works

    • SAE
      • https://transformer-circuits.pub/2022/toy_model/index.html#demonstrating
        • LLM is polysemantic, thus making it difficult to understand how neural networks process information and how to intervene in representation features
        • However, the polysemanticity of neurons has the advantage that we can use fewer neurons to represent more concepts.
        • In this way, features are no longer orthogonal and thus interfere with each other, but this problem would seem to be mitigated by nonlinear functions
        • especially, MoE is one of the reason to make the FFN sparse
      • https://transformer-circuits.pub/2023/monosemantic-features/index.html
        • SAEs offer a promising route to interpretabil- ity by learning sparse dictionaries that reconstruct intermediate activations and surface monosemantic features
      • ,
        • Transcoders have recently been proposed to approximate dense MLP sublayers with wider, sparsely-activating replacements that better expose circuit structure
        • Also, crosscoder which is ~, came out for representing common features across layer.

    Methodology

    • SAE: Cross-layer [
      • attn → router → expert(k) → sum → residual
        • ㄴ Transcoder ㅡI
          • ㄴ SAE
      • SAE(residual-stream) → expert feature 로 해석할 수 있음,
      • Transcoder(attn.output, [router] expert.sum.output) → router feature ]
      • Question
        • How can we define using cross-layer is better?
        • How can we explain justification of selected layer and previous number or layers
        • Should we do other experiments for this justification
    • Dataset: DCLM
    • Models: GPT-OSS 20B, DeepSeekMoE, FlexOlmo
      • Explanation of why these models were selected

    Experiments

    • So Far
      • Used 1 % of DCLM (local-shard-01) containing approximately 100000 rows
        • filtered with text length under 5200 which is mean len of the shard
      • Trained SAE based on this dataset and get first 30000 rows
      • Used Basic SAE for testing with
        • Since, transcoder_lens and sae_lens only support models they already integrated with
        • Also, do not support crosscoder or cross-layer SAE
      • Some Engineering
        • Since dataset is too large, used stream method in huggingface datasets and fixed some codes of sparsify for it
        • Regarding the model size, the LLM model and SAE models couldn’t be on the 80GB memory in A100 at once. Modified some code to support initialize LLM model on cuda:0 and SAE on cuda:1 and transfer its data between
    • Future Works
      • Cross-layer SAE
      • Cross-layer Transcoder
      • Add deepseekmoe case
     

    Results

    • So Far
     

    Next Plan

    • cross-layer training gptoss, flexolmo DCLM 10%
      • layer 확정, 얼만큼의 주변 layer 개수
      • loss wandb ⇒ batchtopk (학습하기 전에 setting 값 전달)
      • 각 feature 별 명확한 description ⇒ llm 으로 가져오고 → 해당되는 expert mapping
      • 약간의 visualization
    • 왜 MoE 는 Hard to post-training 인지, LoRA
     

    Future Works

    • Evaluation of SAE itself
    • Visualization
    • Application on post-training
     

    Paper Draft

     

    Recommendations