Reduced the number of non-matmul FLOP because each non-matmul FLOP is 16x more expensive than a matmul FLOPBetter Sequence Parallelism Causal maskBetter work partitioning with blocks and wraps flash-attention/flash-attention-v2 installation failedUpdated 2023 Oct 19 7:11Efficient Inference on a Single GPUWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flash-attention-2