Flash Attention

Creator
Creator
Alan JoAlan Jo
Created
Created
2023 Jun 29 14:42
Editor
Editor
Alan JoAlan Jo
Edited
Edited
2024 May 16 16:11
Refs
Refs
CUDA PTX

memory issue of GPU

You will just not materialize the attention matrix by build small matrices and keep the statistics to compute softmax along the way. After that, each small part of matrix is computed in SRAM
notion image
Tiling과 Recomputation을 사용하여 Attention을 가속화
Flash Attention usages
 
 
 

Device support capacity

RuntimeError: FlashAttention backward for head dim > 192 requires A100/A800 or H100/H800

Windows

 
 
 
 

Recommendations