GPU

GPU

Creator
Creator
Seonglae Cho
Created
Created
2020 Jan 19 5:28
Editor
Edited
Edited
2025 Jul 5 14:5
Refs

Graphics processing unit

The structure progresses from slow global memory (VRAM) to fast Shared Memory (SRAM) to ultra-fast registers, with computation speed far exceeding memory bandwidth. Overheads include memory bandwidth limitations (memory-bound) vs. computational processing capacity limitations (compute-bound), and host overhead from repeated small kernels. Arithmetic Intensity (AI): FLOP/byte ratio must be approximately 13 or higher to transition from memory-bound to compute-bound.
Optimization strategies to avoid bounds include Operation Fusion: reducing memory traffic by processing intermediate results in one go without writing to memory, and Tiling: maximizing read/write reusability by loading large tiles into Shared Memory.
  • Coalesced Loading: optimizing global memory efficiency by reading continuous 128-byte blocks at once in warp units.
  • Bank Conflict Avoidance: preventing bank conflicts by on-the-fly transposition when storing B tiles in Shared Memory.
GPU Notion
 
 
GPU Usages
 
https://x.com/RajaXg/status/1812721241985610147
 
 

To understand GPU implementation with
ISA
:
tiny-gpu
adam-majUpdated 2025 Jul 5 13:7

 
 

Recommendations