Training Expers to Coordinate
The model anchors on a single expert that all experts share. The router primarily uses softmax and is trained end-to-end alongside all expert modules. However, in this approach, they decompose the weight matrix W into expert-specific router embeddings with a dedicated embedder. Removing specific experts affects only their particular evaluation metrics while leaving others unaffected.
FlexOlmo
allenai • Updated 2025 Jul 28 8:18