ACL 2023
Context is divided into multiple independent pieces, with attention prohibited between pieces during parallel encoding, then integrated only at the query stage.
CEPE (Context Expansion with Parallel Encoding)
Context is divided into multiple blocks, each block encoded independently, with full attention only at query time with Cross-Attention

Attention Entropy
Applying parallel context encoding directly to full-attention LLMs causes significant performance degradation. Abnormally high attention entropy occurs in query tokens and shows strong correlation with performance. Parallel encoding increases entropy through key/logit scale instability and multiple attention sinks. Adding a shared attention sink and selective attention (top-K block selection) reduces entropy and substantially mitigates the performance gap.
If the query attends to all pieces, attention can spread too widely (entropy↑). Therefore, only a few "promising" pieces are kept at the piece level, and the rest are excluded from attention. Context is split into P sub-pieces, where each piece performs self-attention only within itself (inter-piece attention prohibited) during encoding, and only query tokens attend to all pieces. However, selective attention inherently involves information loss: if the answer is in a discarded block, recall drops to zero, and multi-document reasoning that requires combining information across blocks suffers.

Seonglae Cho