Nemotron

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Nov 21 11:6
Editor
Edited
Edited
2025 Dec 18 18:30
Refs

Multimodal

 
 
 
 
 

Mistral Minitron

Hybrid
Mamba Model
–Transformer language model with only about 3.2B out of 31.6B total parameters activated for high efficiency
Attention mechanisms become computationally and memory-intensive (especially KV cache) as sequence length increases, whereas Mamba-based models (State Space Models) are structurally designed to scale more efficiently on long sequences. Therefore, using Mamba for most layers makes it easier to gain advantages in throughput/memory such as
Grouped-query Attention
 
 

Backlinks

MoE Model

Recommendations