Window attention, where only the most recent KVs are cached, is a natural approachSparse AttentionsSliding window attentionBigBird attentionLSG attentionDynamic Sparse AttentionStar Attention Hugging Face Reads, Feb. 2021 - Long-range TransformersWe’re on a journey to advance and democratize artificial intelligence through open source and open science.https://huggingface.co/blog/long-range-transformersLLMs May Not Need Dense Self AttentionSink Tokens and the Sparsity of Attention Scores in Transformer Modelshttps://medium.com/@buildingblocks/llms-may-not-need-dense-self-attention-1fa3bf47522e