Flash Attention2

Creator

Creator

Seonglae Cho

Created

Created

2023 Oct 14 3:22

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Mar 31 15:4

Refs

Refs

candle-flash-attn-v1

huggingface • Updated 2023 Oct 11 15:40

Reduced the number of non-matmul FLOP because each non-matmul FLOP is 16x more expensive than a matmul FLOP

Better
Sequence Parallelism

Causal mask

Better work partitioning with blocks and wraps

notion image

flash-attention/flash-attention-v2 installation failed

Updated 2023 Oct 19 7:11

Efficient Inference on a Single GPU

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Efficient Inference on a Single GPU

https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash-attention-2

Efficient Inference on a Single GPU

Backlinks

Recommendations

///////////