NVIDIA Server GPU Architecture
The programming model shifts from high occupancy to single-CTA occupancy paradigm, supporting powerful performance scaling through instruction decoupling, memory and computational resource separation
- Volta (1st Gen): Introduced HMMA instructions, added warp-level Tensor Cores processing 8×8×4 MMA in 8-thread quad pairs → Supports FP16 inputs with FP32 accumulation.
- Turing (2nd Gen): Added INT8·INT4 support, improved thread configuration and reduced register pressure with warp-synchronized MMA, began DLSS utilization.
- Ampere (3rd Gen): Expanded MMA to full 32-thread warps, added BF16 support, reduced register burden with
cp.async
asynchronous global→shared memory copies andldmatrix
vectorized loads.
- Hopper (4th Gen): Introduced 128-thread warp group asynchronous MMA (wgmma), accelerated large global→shared memory transfers with TMA (Tensor Memory Accelerator), enhanced SM collaboration with CGA (Thread Block Clusters).
- Blackwell (5th Gen): Added TMEM (Tensor Memory), implemented CTA pair-based 2SM MMA operations, introduced single-thread MMA dispatch, expanded ultra-low precision data types (MXFP·NVFP4), introduced structural sparsity (pairs 4:8).