MatTransformer

MatTransformer introduces a 'matryoshka-like' nested structure to the Feed-Forward Network (FFN), packing multiple sub-models of different sizes within a single large model. During training, it randomly samples different FFN sizes (e.g., 0.5×, 1×, 2×, 4×) at each step for simultaneous optimization. Using Mix'n'Match, different layer sizes can be combined to extract hundreds of new sub-models.

This enables Elastic Inference: selecting the optimal model size on-the-fly based on latency and cost constraints. The shared weights ensure inference consistency, reducing prediction variance between small and large models, which enhances

Speculative Decoding speed. Without separate compression or teacher models, it provides "multiple optimal models from a single training session", enabling flexible deployment from mobile devices to large clusters.

2023

arxiv.org

https://arxiv.org/pdf/2310.07707

MatTransformer

Backlinks

Recommendations