Graphics processing unit
The structure progresses from slow global memory (VRAM) to fast Shared Memory (SRAM) to ultra-fast registers, with computation speed far exceeding memory bandwidth. Overheads include memory bandwidth limitations (memory-bound) vs. computational processing capacity limitations (compute-bound), and host overhead from repeated small kernels. Arithmetic Intensity (AI): FLOP/byte ratio must be approximately 13 or higher to transition from memory-bound to compute-bound.
Optimization strategies to avoid bounds include Operation Fusion: reducing memory traffic by processing intermediate results in one go without writing to memory, and Tiling: maximizing read/write reusability by loading large tiles into Shared Memory.
- Coalesced Loading: optimizing global memory efficiency by reading continuous 128-byte blocks at once in warp units.
- Bank Conflict Avoidance: preventing bank conflicts by on-the-fly transposition when storing B tiles in Shared Memory.
GPU Notion
GPU Usages

To understand GPU implementation with ISA: tiny-gpuadam-maj • Updated 2025 Jul 5 13:7
tiny-gpu
adam-maj • Updated 2025 Jul 5 13:7