Glyph VLM

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 31 0:35
Editor
Edited
Edited
2025 Oct 31 0:36
Refs
visual-text compression) processes text through a Vision-Language Model (VLM), extending the LLM's context window by 3-4x. By rendering long text as images, a single visual token can contain information from multiple text tokens, achieving increased information density and reduced computation.

Training pipeline

  • Continual Pre-Training – Learning visual text understanding through OCR, interleaved, and generation tasks
  • LLM-based rendering exploration – Optimizing rendering parameters like DPI, font, and color using genetic algorithms
  • Post-Training – Enhancing accuracy and reasoning style through SFT and GRPO-based RL
Unlike existing token-based expansion approaches like
Contextual Compression
, this represents a novel direction similar to how the human brain operates
 
 
 
 
 

Recommendations