Mixture of Depth
Leaves option to just skip the layer
This new method dynamically allocates computation in transformer models, optimizing resource use while ensuring accuracy. It processes complex tokens selectively and skips simpler ones, cutting computational overhead significantly.
MoD checks each token's complexity within a sequence, applying computation selectively to those needing deeper processing. This strategy moves away from the traditional approach of uniformly allocating computation across all tokens.
Unlike PAUSE Token virtually implement an additional attention layers, MoD remove unnecessary attention layer’s computation if context vector is enough to predict next token precisely.