Parallel Context Encoding

ACL 2023

Context is divided into multiple independent pieces, with attention prohibited between pieces during parallel encoding, then integrated only at the query stage.

arxiv.org

https://arxiv.org/pdf/2212.10947

CEPE (Context Expansion with Parallel Encoding)

Context is divided into multiple blocks, each block encoded independently, with full attention only at query time with

Cross-Attention

arxiv.org

https://arxiv.org/pdf/2402.16617

Attention Entropy

Applying parallel context encoding directly to full-attention LLMs causes significant performance degradation. Abnormally high attention entropy occurs in query tokens and shows strong correlation with performance. Parallel encoding increases entropy through key/logit scale instability and multiple attention sinks. Adding a shared attention sink and selective attention (top-K block selection) reduces entropy and substantially mitigates the performance gap.

If the query attends to all pieces, attention can spread too widely (entropy↑). Therefore, only a few "promising" pieces are kept at the piece level, and the rest are excluded from attention. Context is split into P sub-pieces, where each piece performs self-attention only within itself (inter-piece attention prohibited) during encoding, and only query tokens attend to all pieces. However, selective attention inherently involves information loss: if the answer is in a discarded block, recall drops to zero, and multi-document reasoning that requires combining information across blocks suffers.

arxiv.org

https://arxiv.org/pdf/2412.16545

Parallel Context Encoding

ACL 2023

CEPE (Context Expansion with Parallel Encoding)

Attention Entropy

Recommendations