ACL 2023
Context is divided into multiple independent pieces, with attention prohibited between pieces during parallel encoding, then integrated only at the query stage.
arxiv.org
https://arxiv.org/pdf/2212.10947
CEPE (Context Expansion with Parallel Encoding)
Context is divided into multiple blocks, each block encoded independently, with full attention only at query time with Cross-Attention

arxiv.org
https://arxiv.org/pdf/2402.16617
Attention Entropy
Applying parallel context encoding directly to full-attention LLMs causes significant performance degradation. Abnormally high attention entropy occurs in query tokens and shows strong correlation with performance. Parallel encoding increases entropy through key/logit scale instability and multiple attention sinks. Adding a shared attention sink and selective attention (top-K block selection) reduces entropy and substantially mitigates the performance gap.
If the query attends to all pieces, attention can spread too widely (entropy↑). Therefore, only a few "promising" pieces are kept at the piece level, and the rest are excluded from attention. Context is split into P sub-pieces, where each piece performs self-attention only within itself (inter-piece attention prohibited) during encoding, and only query tokens attend to all pieces. However, selective attention inherently involves information loss: if the answer is in a discarded block, recall drops to zero, and multi-document reasoning that requires combining information across blocks suffers.
arxiv.org
https://arxiv.org/pdf/2412.16545

Seonglae Cho