Glyph VLM

visual-text compression) processes text through a Vision-Language Model (VLM), extending the LLM's context window by 3-4x. By rendering long text as images, a single visual token can contain information from multiple text tokens, achieving increased information density and reduced computation.

Training pipeline

Continual Pre-Training – Learning visual text understanding through OCR, interleaved, and generation tasks

LLM-based rendering exploration – Optimizing rendering parameters like DPI, font, and color using genetic algorithms

Post-Training – Enhancing accuracy and reasoning style through SFT and GRPO-based RL

Unlike existing token-based expansion approaches like

Contextual Compression, this represents a novel direction similar to how the human brain operates

www.arxiv.org

https://www.arxiv.org/pdf/2510.17800v2

Glyph VLM

Training pipeline

Backlinks

Recommendations