- Reduced the number of non-matmul FLOP because each non-matmul FLOP is 16x more expensive than a matmul FLOP
- Better Sequence Parallelism
- Causal mask
- Better work partitioning with blocks and wraps

Efficient Inference on a Single GPU
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash-attention-2

Seonglae Cho