Conditional Memory via Scalable Lookup: A New Dimension of Sparsity for Large Language Models

Author: Xin Cheng; Wangding Zeng; Damai Dai; Qinyu Chen; Bingxuan Wang; Zhenda Xie; Kezhao Huang; Xingkai Yu; Zhewen Hao; Yukun Li; Han Zhang; Huishuai Zhang; Dongyan Zhao; Wenfeng Liang

Summary
#

WHAT: Engram is a conditional memory module that modernizes classic N-gram embeddings for O(1) lookup, integrated into large language models (LLMs) to enhance the efficiency of managing static knowledge and local dependencies.

WHY: Engram addresses the inefficiency of Transformer models in simulating knowledge retrieval through computation, offering a new axis of sparsity complementary to the conditional computation paradigm (MoE). This improves performance across various domains, including knowledge retrieval, general reasoning, and coding and math tasks.

WHO: Key players include researchers and engineers from DeepSeek-AI and Peking University, who developed Engram, and the AI research community studying and implementing advanced language models.

WHERE: Engram positions itself in the market of large language models (LLMs), integrating with existing architectures like Mixture-of-Experts (MoE) to enhance efficiency and performance.

WHEN: Engram is an emerging technology gaining attention for its potential to improve language model performance. Its maturity is in the development phase, with ongoing studies and implementations.

BUSINESS IMPACT:

Opportunities: Engram can be integrated into the existing stack to improve language model performance, reducing computational costs and enhancing knowledge retrieval efficiency.
Risks: Competition with other conditional memory technologies and the adoption of new language model architectures could pose a threat.
Integration: Engram can be easily integrated with existing MoE architectures, offering immediate performance improvements without the need to completely re-train models.

TECHNICAL SUMMARY:

Core Technology Stack: Engram uses modernized N-gram embeddings, tokenizer compression, multi-head hashing, contextualized gating, and multi-branch integration. The model is implemented in Python and uses deep learning frameworks like PyTorch.
Scalability and Architectural Limits: Engram can scale up to billions of parameters, with a model size of 175B parameters. Its efficiency is demonstrated in large-scale pre-training and inference scenarios.
Key Technical Differentiators: Engram offers O(1) lookup for static patterns, reduces the computational depth required for knowledge retrieval, and frees attention capacity for global context. Its infrastructure efficiency allows for asynchronous prefetching of embeddings, reducing communication overhead.

Technical Details:

Engram Pipeline: The Engram pipeline includes two main phases: retrieval and fusion. In the retrieval phase, local contexts are mapped to static memory entries via deterministic hashing. In the fusion phase, the retrieved embeddings are dynamically modulated by the current hidden state and refined through light convolution.
Application Examples:
- Knowledge Retrieval: Engram improves knowledge retrieval in benchmarks like MMLU, CMMLU, and MMLU-Pro.
- General Reasoning: Shows significant gains in general reasoning benchmarks like BBH, ARC-Challenge, and DROP.
- Coding and Math: Improves performance in coding and math benchmarks like HumanEval, MATH, and GSMK.
- Long Context: Enhances retrieval and reasoning capabilities in long contexts, as demonstrated in benchmarks like LongPPL and RULER.
Usage Examples:
- Pre-training: Engram has been used in large-scale pre-training models, such as Engram-B and Engram-B, which have shown significant improvements over MoE baselines.
- Inference: During inference, Engram allows for asynchronous prefetching of embeddings, reducing communication overhead and improving efficiency.
- Gating Visualization: The visualization of Engram’s gating mechanism shows that the module effectively identifies and retrieves stereotypical linguistic patterns, such as multi-token entities and formulaic phrases.

Use Cases
#

Private AI Stack: Integration into proprietary pipelines
Client Solutions: Implementation for client projects

Resources
#

Original Links
#

Article recommended and selected by the Human Technology eXcellence team, processed through artificial intelligence (in this case with LLM HTX-EU-Mistral3.1Small) on 2026-01-27 12:30 Original source:

Articoli Interessanti - This article is part of a series.

Part : GitHub - moltbot/moltbot: Your own personal AI assistant. Any operating system. Any platform. The lobster way. 🦞

Part : GitHub - aiming-lab/SimpleMem: SimpleMem: Efficient Lifelong Memory for LLM Agents

Part : GitHub - mikekelly/claude-sneakpeek: Obtain a parallel build of Claude code that unlocks feature-flagged capabilities such as swarm mode.

Part : moonshotai/Kimi-K2.5 · Hugging Face

Part : Welcome - Poké Documentation

Part : This Article

Part : NVIDIA PersonaPlex: Natural Conversational AI With Any Role and Voice - NVIDIA ADLR

Part : GitHub - different-ai/openwork: An open-source alternative to Claude Cowork, powered by OpenCode.

Part : GitHub - google/langextract: A Python library for extracting structured information from unstructured text using large language models (LLMs) with precision.

Part : GitHub - memodb-io/Acontext: Data platform for context engineering. A context data platform that stores, observes, and learns. Join

Part : GitHub - rberg27/doom-coding: A guide on how to use your smartphone to code anywhere at any time.