DeepSeek-OCR

The model architecture consists of a dedicated vision encoder (DeepEncoder) + language decoder (DeepSeek-3B MoE). DeepEncoder operates in the following sequence: SAM-based local processing → 16x token compression → CLIP for global context organization, converting high-resolution images into very few vision tokens (e.g., a 1024×1024 document image is reduced to around 256 tokens instead of 4096 vision tokens).

OCR

In compression ratio vs. accuracy experiments, even with approximately 9-10x compression, OCR restoration accuracy maintains around 97% level. This paper directly demonstrates that compression is feasible by showing OCR capability. Compared to

Glyph VLM, which took a more directly practical approach.

www.arxiv.org

https://www.arxiv.org/pdf/2510.18234v1

deepseek-ai/DeepSeek-OCR · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co/deepseek-ai/DeepSeek-OCR

DeepSeek-OCR

OCR

Backlinks

Recommendations