Multimodal
Mistral Minitron
HybridMamba Model–Transformer language model with only about 3.2B out of 31.6B total parameters activated for high efficiency
Attention mechanisms become computationally and memory-intensive (especially KV cache) as sequence length increases, whereas Mamba-based models (State Space Models) are structurally designed to scale more efficiently on long sequences. Therefore, using Mamba for most layers makes it easier to gain advantages in throughput/memory such as Grouped-query Attention

Seonglae Cho

