1. Context Encoding
blockwise-local attention across distributed hosts
2. Query Encoding and Token Generation
query and response tokens use sequence-global attention to access cached tokens

Star Attention: Efficient LLM Inference over Long Sequences
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star...
https://arxiv.org/abs/2411.17116


Seonglae Cho