- Ampere (3rd Gen): Expanded MMA to full 32-thread warps, added BF16 support, reduced register burden with
cp.async
asynchronous global→shared memory copies andldmatrix
vectorized loads.
NVIDIA Ampere Architecture Products
cp.async
asynchronous global→shared memory copies and ldmatrix
vectorized loads.