AI Load Balancing
Usually MLP routing is done per layer (since attention weights are shared) and routing is based on Affinity Score, then top-k is selected and weighted sum is performed based on the scores.
Typically, there is a small single layer per layer, then softmax logits are applied and top-k is selected
MoE Routing Notion
DeepSeek
Load Balancing loss (Distributed ML)
Global-batch load balance almost free lunch to improve your MoE LLM training
GITHUB HUGGING FACE MODELSCOPE DISCORD Background The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models, each expert is one feedforward layer). Given an input, only a subset of experts will be activated, and then their outputs will be aggregated based on the scores the router assigned.
https://qwenlm.github.io/blog/global-load-balance/
Mixture-of-Experts with Expert Choice Routing
https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html

Octopus
NexaAIDev/Octopus-v4 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/NexaAIDev/Octopus-v4

Seonglae Cho