Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/Machine Learning/Neural Network/Neural Network Structure/Seq2Seq/Attention Mechanism/Attention Mechanism Optimization/Flash Attention/
Flash Attention2
Search

Flash Attention2

Creator
Creator
Seonglae Cho
Created
Created
2023 Oct 14 3:22
Editor
Editor
Seonglae Cho
Edited
Edited
2024 Mar 31 15:4
Refs
Refs
  • Reduced the number of non-matmul FLOP because each non-matmul FLOP is 16x more expensive than a matmul FLOP
  • Better
    Sequence Parallelism
  • Causal mask
  • Better work partitioning with blocks and wraps
notion image
 
 
 
flash-attention/flash-attention-v2 installation failed
Updated 2023 Oct 19 7:11
Efficient Inference on a Single GPU
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Efficient Inference on a Single GPU
https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash-attention-2
Efficient Inference on a Single GPU
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/Machine Learning/Neural Network/Neural Network Structure/Seq2Seq/Attention Mechanism/Attention Mechanism Optimization/Flash Attention/
Flash Attention2
Copyright Seonglae Cho