MoE Routing

Creator

Creator

Created

Created

2024 Jan 16 4:18

Editor

Editor

Edited

Edited

2025 Feb 4 22:2

Refs

Refs

Gating Mechanism

AI Load Balancing

Usually MLP routing is done per layer (since attention weights are shared) and routing is based on

Affinity Score, then top-k is selected and weighted sum is performed based on the scores.

MoE Routing Notion

MoE Gating network

Sparse Gated MoE

Auxiliary-Loss-Free Load Balancing

Auxiliary Loss for Load Balance

MoE Node-Limited Routing

Load Balancing loss (
Parallel Training)

Global-batch load balance almost free lunch to improve your MoE LLM training

GITHUB HUGGING FACE MODELSCOPE DISCORD Background The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models, each expert is one feedforward layer). Given an input, only a subset of experts will be activated, and then their outputs will be aggregated based on the scores the router assigned.

Global-batch load balance almost free lunch to improve your MoE LLM training

https://qwenlm.github.io/blog/global-load-balance/

Mixture-of-Experts with Expert Choice Routing

Mixture-of-Experts with Expert Choice Routing

https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html

Mixture-of-Experts with Expert Choice Routing

Octopus

NexaAIDev/Octopus-v4 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

NexaAIDev/Octopus-v4 · Hugging Face

https://huggingface.co/NexaAIDev/Octopus-v4

NexaAIDev/Octopus-v4 · Hugging Face

DeepSeek

deepseek-ai • Updated 2025 Feb 22 13:51

Recommendations

///////