DeepSeek-OCR

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2025 Oct 29 1:17
Editor
Edited
Edited
2026 Feb 4 16:58
Refs
Refs
Glyph VLM
The model architecture consists of a dedicated vision encoder (DeepEncoder) + language decoder (DeepSeek-3B MoE). DeepEncoder operates in the following sequence: SAM-based local processing → 16x token compression → CLIP for global context organization, converting high-resolution images into very few vision tokens (e.g., a 1024×1024 document image is reduced to around 256 tokens instead of 4096 vision tokens).

OCR

In compression ratio vs. accuracy experiments, even with approximately 9-10x compression, OCR restoration accuracy maintains around 97% level. This paper directly demonstrates that compression is feasible by showing OCR capability. Compared to
Glyph VLM
, which took a more directly practical approach.
 
 
 
www.arxiv.org
deepseek-ai/DeepSeek-OCR · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
deepseek-ai/DeepSeek-OCR · Hugging Face
 
 

Recommendations