visual-text compression) processes text through a Vision-Language Model (VLM), extending the LLM's context window by 3-4x. By rendering long text as images, a single visual token can contain information from multiple text tokens, achieving increased information density and reduced computation.
Training pipeline
- Continual Pre-Training – Learning visual text understanding through OCR, interleaved, and generation tasks
- LLM-based rendering exploration – Optimizing rendering parameters like DPI, font, and color using genetic algorithms
- Post-Training – Enhancing accuracy and reasoning style through SFT and GRPO-based RL
Unlike existing token-based expansion approaches like Contextual Compression, this represents a novel direction similar to how the human brain operates

Seonglae Cho