MoE

Mixture-of-Experts

Mixture-of-Experts models improves efficiency by activating a small subset of model weights for a given input, decoupling model size from inference efficiency.

MoEs have seen great success in LLMs. In a nutshell, MoEs are pre-trained faster, and have a faster inference, but require more memory and face challenges in fine-tuning.

Transformer MoE

Usually only the MLP layer is divided into MoE and routed by layer. The reason why MoE uses

Weight Sharing for the attention layer is that, aside from the training being difficult, it's an inter-token operation where token residual embedding changes, making it problematic if not shared, while the MLP, which isn't like that, is suitable for MoE as it acts as a key-value storage with diverse memory.

The reason why

Cross-Attention doesn't occur between experts is because the input residual and attention are coupled together, but like what the

Corpus callosum does, information exchange like cross-attention between the

Left Cerebral hemisphere and

Right Left Cerebral hemisphere is a good design insight

MoE Notion