memory issue of GPU
You will just not materialize the attention matrix by build small matrices and keep the statistics to compute softmax along the way. After that, each small part of matrix is computed in SRAM
Tiling과 Recomputation을 사용하여 Attention을 가속화
Flash Attention usages
Device support capacity
RuntimeError: FlashAttention backward for head dim > 192 requires A100/A800 or H100/H800