MoE Routing

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Jan 16 4:18
Editor
Edited
Edited
2025 Nov 14 0:43

AI Load Balancing

Usually MLP routing is done per layer (since attention weights are shared) and routing is based on
Affinity Score
, then top-k is selected and weighted sum is performed based on the scores.
Typically, there is a small single layer per layer, then softmax logits are applied and top-k is selected
MoE Routing Notion
 
 
 

DeepSeek

Load Balancing loss (
Distributed ML
)

Global-batch load balance almost free lunch to improve your MoE LLM training
GITHUB HUGGING FACE MODELSCOPE DISCORD Background The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models, each expert is one feedforward layer). Given an input, only a subset of experts will be activated, and then their outputs will be aggregated based on the scores the router assigned.
Mixture-of-Experts with Expert Choice Routing
Mixture-of-Experts with Expert Choice Routing

Octopus

NexaAIDev/Octopus-v4 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
NexaAIDev/Octopus-v4 · Hugging Face
 
 

Recommendations