Multi-Token Attention
Attention-native Convolutional Layer good at tasks require searching for information within long contexts since the single token attention bottlenecks the amount of information used in the context
How 3d convolution
Multi-head Latent Attention Unlike this, it is not more efficient, so without significant performance improvements when scaling, it has no market advantage