MoD

Mixture of Depths

Leaves option to just skip the layer

This new method dynamically allocates computation in transformer models, optimizing resource use while ensuring accuracy. It processes complex tokens selectively and skips simpler ones, cutting computational overhead significantly.

MoD checks each token's complexity within a sequence, applying computation selectively to those needing deeper processing. This strategy moves away from the traditional approach of uniformly allocating computation across all tokens.

Unlike

PAUSE Token virtually implement an additional attention layers, MoD remove unnecessary attention layer’s computation if context vector is enough to predict next token precisely.

arxiv.org

https://arxiv.org/pdf/2404.02258.pdf

MoD

Mixture of Depths

Backlinks

Recommendations