Conditional Memory via Scalable Lookup
A New Axis of Sparsity for Large Language Models
Engram
deepseek-ai • Updated 2026 Jan 13 19:7

To address the inefficiency of knowledge lookup that MoE (conditional computation) alone cannot solve, we propose a new axis of sparsity: Conditional Memory. Existing Transformers lack a lookup primitive, requiring them to reconstruct static knowledge through multiple layers of computation inefficiently. Engram performs direct static knowledge retrieval via N-gram-based O(1) hash lookup, then adjusts it to the current context using context-aware gating.

Results show a Sparsity Allocation Law: there exists a U-shaped optimal allocation point between MoE and Engram parameters → performance peaks when allocating approximately 20-25% to Engram rather than pure MoE. Early layers skip static pattern processing, allowing more depth for reasoning. Long-context performance significantly improved (RULER, LongPPL). Deterministic lookup enables CPU/Host memory offloading, with 100B memory incurring <3% overhead.

Memory Network Unlike Memory Network, it has no computation overhead and is based on O(1) deterministic hash lookup. Memory can be scaled massively and easily integrated with modern transformers and MoE.
www.arxiv.org
https://www.arxiv.org/pdf/2601.07372

Seonglae Cho