NVIDIA Server GPU

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Oct 16 12:59
Editor
Edited
Edited
2025 Jul 3 10:5
NVIDIA Server GPU Architecture
 
 
 
 
The programming model shifts from high occupancy to single-CTA occupancy paradigm, supporting powerful performance scaling through instruction decoupling, memory and computational resource separation
  • Volta (1st Gen): Introduced HMMA instructions, added warp-level Tensor Cores processing 8×8×4 MMA in 8-thread quad pairs → Supports FP16 inputs with FP32 accumulation.
  • Turing (2nd Gen): Added INT8·INT4 support, improved thread configuration and reduced register pressure with warp-synchronized MMA, began DLSS utilization.
  • Ampere (3rd Gen): Expanded MMA to full 32-thread warps, added BF16 support, reduced register burden with cp.async asynchronous global→shared memory copies and ldmatrix vectorized loads.
  • Hopper (4th Gen): Introduced 128-thread warp group asynchronous MMA (wgmma), accelerated large global→shared memory transfers with TMA (Tensor Memory Accelerator), enhanced SM collaboration with CGA (Thread Block Clusters).
  • Blackwell (5th Gen): Added TMEM (Tensor Memory), implemented CTA pair-based 2SM MMA operations, introduced single-thread MMA dispatch, expanded ultra-low precision data types (MXFP·NVFP4), introduced structural sparsity (pairs 4:8).
 
 
 

Recommendations