While some have experimented with scaling Mamba, none have scaled it beyond 3B parameters
Sub-quadratic solution like linear attention instead of attention block
Scalability 적어서 On-device Small LLM으로 주목받는중
RNN과 모양새는 비슷하고 트랜스포머와 RNN의 장단점을 어느 정도 합쳐 놓은
Attention mechanisms become computationally and memory-intensive (especially KV cache) as sequence length increases, whereas Mamba-based models (State Space Models) are structurally designed to scale more efficiently on long sequences. Therefore, using Mamba for most layers makes it easier to gain advantages in throughput/memory such as Grouped-query Attention
Mamba Models
Samba
Vision Mamba Vimhustvl • Updated 2024 Mar 31 8:18
Vim
hustvl • Updated 2024 Mar 31 8:18

Seonglae Cho
