- Ampere (3rd Gen): Expanded MMA to full 32-thread warps, added BF16 support, reduced register burden with
cp.asyncasynchronous global→shared memory copies andldmatrixvectorized loads.
NVIDIA Ampere Architecture Products
Seonglae Cho
Seonglae Chocp.async asynchronous global→shared memory copies and ldmatrix vectorized loads.