MoE

Creator
Creator
Seonglae Cho
Created
Created
2023 Apr 12 14:20
Editor
Edited
Edited
2025 Feb 6 11:36

Mixture-of-Experts

Mixture-of-Experts models improves efficiency by activating a small subset of model weights for a given input, decoupling model size from inference efficiency.
MoEs have seen great success in LLMs. In a nutshell, MoEs are pre-trained faster, and have a faster inference, but require more memory and face challenges in fine-tuning.

Transformer MoE

Usually only the MLP layer is divided into MoE and routed by layer. The reason why MoE uses
Weight Sharing
for the attention layer is that, aside from the training being difficult, it's an inter-token operation where token residual embedding changes, making it problematic if not shared, while the MLP, which isn't like that, is suitable for MoE as it acts as a key-value storage with diverse memory.
The reason why
Cross-Attention
doesn't occur between experts is because the input residual and attention are coupled together, but like what the
Corpus callosum
does, information exchange like cross-attention between the
Left Cerebral hemisphere
and
Right Left Cerebral hemisphere
is a good design insight
MoE Notion
 
 
 

Structure

notion image
1991
Scaling for chips
 
 
 

Recommendations