Attention Mechanism OptimizationsSparse AttentionMonarch MixerFlash AttentionDilated AttentionPagedAttentionGroup Query AttentiionMulti Query AttentionClustered attentionLayer Selective Rank ReductionKV CacheFAVOR+Paged AttentionChunk AttentionMemory-efficient AttentionGated AttentionFlexAttentionSelective AttentionFire Attention Multi-head Attention OptimizationMulti-head AttentionGrouped-query AttentionMulti-query AttentionLSH Attention Sigmoid Attention, replacing the traditional softmax with a sigmoid and a constant biasTheory, Analysis, and Best Practices for Sigmoid Self-AttentionAttention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically...https://arxiv.org/abs/2409.04431How to make LLMs go fastBlog about linguistics, programming, and my projectshttps://vgel.me/posts/faster-inference/A guide to LLM inference and performanceTo attain the full power of a GPU during LLM inference, you have to know if the inference is compute bound or memory bound. Learn how to better utilize GPU resources.https://www.baseten.co/blog/llm-transformer-inference-guideOptimizationHugging Face Reads, Feb. 2021 - Long-range TransformersWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/blog/long-range-transformers