Star Attention

Creator

Creator

Seonglae Cho

Created

Created

2024 Nov 29 21:35

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Nov 29 21:36

Refs

Refs

1. Context Encoding

blockwise-local attention across distributed hosts

2. Query Encoding and Token Generation

query and response tokens use sequence-global attention to access cached tokens

notion image

Star Attention: Efficient LLM Inference over Long Sequences

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star...

https://arxiv.org/abs/2411.17116

Recommendations

///////////