MatTransformer introduces a 'matryoshka-like' nested structure to the Feed-Forward Network (FFN), packing multiple sub-models of different sizes within a single large model. During training, it randomly samples different FFN sizes (e.g., 0.5×, 1×, 2×, 4×) at each step for simultaneous optimization. Using Mix'n'Match, different layer sizes can be combined to extract hundreds of new sub-models.
This enables Elastic Inference: selecting the optimal model size on-the-fly based on latency and cost constraints. The shared weights ensure inference consistency, reducing prediction variance between small and large models, which enhances Speculative Decoding speed. Without separate compression or teacher models, it provides "multiple optimal models from a single training session", enabling flexible deployment from mobile devices to large clusters.
2023