NVIDIA Server GPU

Creator

Creator

Seonglae Cho

Created

Created

2024 Oct 16 12:59

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Jul 3 10:5

Refs

Refs

NVIDIA Server GPU Architecture

NVIDIA Volta Architecture

NVIDIA Turing Architecture

NVIDIA Ampere Architecture

NVIDIA Hopper Architecture

NVIDIA Blackwell Architecture

NVIDIA Rubin Architecture

The programming model shifts from high occupancy to single-CTA occupancy paradigm, supporting powerful performance scaling through instruction decoupling, memory and computational resource separation

Volta (1st Gen): Introduced HMMA instructions, added warp-level Tensor Cores processing 8×8×4 MMA in 8-thread quad pairs → Supports FP16 inputs with FP32 accumulation.

Turing (2nd Gen): Added INT8·INT4 support, improved thread configuration and reduced register pressure with warp-synchronized MMA, began DLSS utilization.

Ampere (3rd Gen): Expanded MMA to full 32-thread warps, added BF16 support, reduced register burden with cp.async asynchronous global→shared memory copies and ldmatrix vectorized loads.

Hopper (4th Gen): Introduced 128-thread warp group asynchronous MMA (wgmma), accelerated large global→shared memory transfers with TMA (Tensor Memory Accelerator), enhanced SM collaboration with CGA (Thread Block Clusters).

Blackwell (5th Gen): Added TMEM (Tensor Memory), implemented CTA pair-based 2SM MMA operations, introduced single-thread MMA dispatch, expanded ultra-low precision data types (MXFP·NVFP4), introduced structural sparsity (pairs 4:8).

NVIDIA Tensor Core Evolution: From Volta To Blackwell

In our AI Scaling Laws article from late last year, we discussed how multiple stacks of AI scaling laws have continued to drive the AI industry forward, enabling greater than Moore’s Law grow…

NVIDIA Tensor Core Evolution: From Volta To Blackwell

https://semianalysis.com/2025/06/23/nvidia-tensor-core-evolution-from-volta-to-blackwell/

NVIDIA Tensor Core Evolution: From Volta To Blackwell

Backlinks

Recommendations

//////