Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Ensemble/Multi-AI Controller/MoE/
gradient starvation
Search

gradient starvation

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Nov 27 23:21
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Nov 27 23:23
Refs
Refs
The biggest problem with Sparse MoE
 
 
 
 
 
 
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means...
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
https://arxiv.org/abs/2504.12463v3
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Ensemble/Multi-AI Controller/MoE/
gradient starvation
Copyright Seonglae Cho