NVIDIA Server GPU Architecture
The programming model shifts from high occupancy to single-CTA occupancy paradigm, supporting powerful performance scaling through instruction decoupling, memory and computational resource separation
- Volta (1st Gen): Introduced HMMA instructions, added warp-level Tensor Cores processing 8×8×4 MMA in 8-thread quad pairs → Supports FP16 inputs with FP32 accumulation.
- Turing (2nd Gen): Added INT8·INT4 support, improved thread configuration and reduced register pressure with warp-synchronized MMA, began DLSS utilization.
- Ampere (3rd Gen): Expanded MMA to full 32-thread warps, added BF16 support, reduced register burden with
cp.asyncasynchronous global→shared memory copies andldmatrixvectorized loads.
- Hopper (4th Gen): Introduced 128-thread warp group asynchronous MMA (wgmma), accelerated large global→shared memory transfers with TMA (Tensor Memory Accelerator), enhanced SM collaboration with CGA (Thread Block Clusters).
- Blackwell (5th Gen): Added TMEM (Tensor Memory), implemented CTA pair-based 2SM MMA operations, introduced single-thread MMA dispatch, expanded ultra-low precision data types (MXFP·NVFP4), introduced structural sparsity (pairs 4:8).
NVIDIA Tensor Core Evolution: From Volta To Blackwell
In our AI Scaling Laws article from late last year, we discussed how multiple stacks of AI scaling laws have continued to drive the AI industry forward, enabling greater than Moore’s Law grow…
https://semianalysis.com/2025/06/23/nvidia-tensor-core-evolution-from-volta-to-blackwell/


Seonglae Cho