Parallel Context Encoding

Created
Created
2026 Jan 2 18:11
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2026 Jan 2 18:26
Refs
Refs
 
 
 
 

ACL 2023

Context is divided into multiple independent pieces, with attention prohibited between pieces during parallel encoding, then integrated only at the query stage.

CEPE (Context Expansion with Parallel Encoding)

Context is divided into multiple blocks, each block encoded independently, with full attention only at query time with
Cross-Attention
notion image

Attention Entropy

Applying parallel context encoding directly to full-attention LLMs causes significant performance degradation. Abnormally high attention entropy occurs in query tokens and shows strong correlation with performance. Parallel encoding increases entropy through key/logit scale instability and multiple attention sinks. Adding a shared attention sink and selective attention (top-K block selection) reduces entropy and substantially mitigates the performance gap.
If the query attends to all pieces, attention can spread too widely (entropy↑). Therefore, only a few "promising" pieces are kept at the piece level, and the rest are excluded from attention. Context is split into P sub-pieces, where each piece performs self-attention only within itself (inter-piece attention prohibited) during encoding, and only query tokens attend to all pieces. However, selective attention inherently involves information loss: if the answer is in a discarded block, recall drops to zero, and multi-document reasoning that requires combining information across blocks suffers.
 
 

Recommendations