1. Context Encodingblockwise-local attention across distributed hosts2. Query Encoding and Token Generationquery and response tokens use sequence-global attention to access cached tokens Star Attention: Efficient LLM Inference over Long SequencesInference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star...https://arxiv.org/abs/2411.17116