LLMs suffer from severely sub-optimal expert routing learned during pretraining, with a 10–20% accuracy gap between the base model’s routing and the optimal pathway. Existing test-time adaptation methods—such as in-context learning, prompt tuning, and prefix tuning—do not directly optimize the MoE pathway structure. This motivates a new approach that dynamically re-mixes expert routing weights on a per-sample basis at test time.
C3PO finds an (approximately) optimal expert pathway for a test sample by leveraging pathway matrices from similar reference samples whose pathways were successful. It proposes three surrogate objectives: (1) Neighborhood Gradient Descent (NGD), which optimizes via gradient descent using a kernel-weighted average of neighbors’ losses; (2) Kernel Regression, which computes and interpolates with the original pathway via ; and (3) Mode Finding (mean-shift), which moves toward the densest region in pathway space. Neighbors are defined via -NN or an -ball, and a Gaussian kernel works best. NGD performs best among the three and reaches 85–95% of oracle performance without ground-truth labels.
To reduce compute, C3PO optimizes only selected critical layers and core experts rather than all layers/experts. Empirically, optimizing only the last 5 layers can outperform optimizing all 16 layers, and optimizing only the top-20 experts (out of 64) can match full-expert optimization. Optimizing only the last token’s routing weights is most effective, and about 10 NGD steps suffice. A practical limitation is dependence on the reference set: it may be hard to obtain suitable reference samples that require ground-truth labels.
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway...
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising...
https://arxiv.org/abs/2504.07964


Seonglae Cho