The model architecture consists of a dedicated vision encoder (DeepEncoder) + language decoder (DeepSeek-3B MoE). DeepEncoder operates in the following sequence: SAM-based local processing → 16x token compression → CLIP for global context organization, converting high-resolution images into very few vision tokens (e.g., a 1024×1024 document image is reduced to around 256 tokens instead of 4096 vision tokens).
OCR
In compression ratio vs. accuracy experiments, even with approximately 9-10x compression, OCR restoration accuracy maintains around 97% level. This paper directly demonstrates that compression is feasible by showing OCR capability. Compared to Glyph VLM, which took a more directly practical approach.
www.arxiv.org
https://www.arxiv.org/pdf/2510.18234v1
deepseek-ai/DeepSeek-OCR · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/deepseek-ai/DeepSeek-OCR

Seonglae Cho