Multi-Token Attention
Attention-native Convolutional Layer good at tasks require searching for information within long contexts since the single token attention bottlenecks the amount of information used in the context

How 3d convolution

MLA Unlike this, it is not more efficient, so without significant performance improvements when scaling, it has no market advantage