DeepSeek-OCR: Contexts Optical Compression

A research implementation of DeepSeek-OCR based on the paper "DeepSeek-OCR: Contexts Optical Compression". This project explores OCR as a compression-decompression task using vision-language models.

Overview

DeepSeek-OCR compresses text-rich images into compact vision tokens, then decodes them back to text using a language model. Key features:

High Compression: 64-400 vision tokens for documents with 600-5000+ text tokens
Strong Performance: ~97% accuracy with <10× compression ratio
Multi-Resolution: Supports Tiny (512²), Small (640²), Base (1024²), Large (1280²) modes
Efficient: Lower token count than traditional VLMs while maintaining quality

Architecture

Image → DeepEncoder (SAM-like → Compressor 16× → CLIP-like) → LM Decoder → Text
        [Stage 1: Window Attn] [Conv 16× reduction] [Global Attn]   [GPT-2]

DeepEncoder (~380M params): Two-stage vision encoder with 16× compression Decoder (~570M params): Language model for text generation

See ARCHITECTURE.md for detailed explanation.

Installation

Requirements

Python 3.8+
PyTorch 2.0+
CUDA 11.8+ (for GPU)
16GB+ RAM (32GB+ recommended)

Quick Install

# Clone repository
git clone https://github.com/yourusername/Deepseek-OCR.git
cd Deepseek-OCR

# Create virtual environment
conda create -n deepseek-ocr python=3.10 -y
conda activate deepseek-ocr

# Install dependencies
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Optional Dependencies

# For Tesseract baseline
sudo apt-get install tesseract-ocr  # Ubuntu
brew install tesseract  # macOS
pip install pytesseract

# For PaddleOCR baseline
pip install paddleocr

# For demo UI
pip install gradio

Quick Start

Inference

# Run OCR on single image
python scripts/infer.py --image path/to/image.jpg --mode base

# Process directory of images
python scripts/infer.py --image_dir path/to/images/ --output results.txt

# Use different resolution mode
python scripts/infer.py --image document.png --mode large

Resolution Modes:

tiny (64 tokens): Fast, for simple documents
small (100 tokens): Balanced quality/speed
base (256 tokens): Standard documents (default)
large (400 tokens): Complex layouts, best quality

Python API

import torch
from transformers import AutoTokenizer
from models.ocr_model import DeepSeekOCR
from utils.image_io import load_image

# Load model
model = DeepSeekOCR()
model = model.eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load and process image
image = load_image("document.jpg", mode="base")

# Generate text
with torch.no_grad():
    generated_ids = model.generate(
        images=image,
        max_length=2048,
        temperature=0.7,
        top_p=0.9
    )

text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(text)

Web Demo

# Launch Gradio interface
python demo/app.py --checkpoint path/to/model.pt --port 7860

# Access at http://localhost:7860

Evaluation

Prepare Dataset

Organize your dataset:

dataset/
├── images/
│   ├── img001.jpg
│   ├── img002.png
│   └── ...
└── annotations.json  # {"img001.jpg": "ground truth text", ...}

Run Evaluation

# Evaluate DeepSeek-OCR
python evaluation/evaluate_ocr.py \
    --dataset path/to/dataset \
    --checkpoint model.pt \
    --mode base \
    --output deepseek_results.json

# Tesseract baseline
python evaluation/baselines/tesseract_eval.py \
    --dataset path/to/dataset \
    --output tesseract_results.json

# PaddleOCR baseline
python evaluation/baselines/paddle_eval.py \
    --dataset path/to/dataset \
    --output paddleocr_results.json

Metrics

CER (Character Error Rate): Character-level edit distance
WER (Word Error Rate): Word-level edit distance
Normalized Edit Distance: Levenshtein distance normalized by length

Expected Results (Paper)

On OmniDocBench:

Model	Tokens/Page	CER	Notes
GOT-OCR2.0	256	-	SOTA baseline
DeepSeek-OCR (Small)	100	-	Beats GOT-OCR2.0
DeepSeek-OCR (Large)	400	-	Matches SOTA
MinerU2.0	~7000	-	Token-heavy
DeepSeek-OCR (Gundam)	<800	-	Beats MinerU

Compression Study (Fox Benchmark):

Compression Ratio	Accuracy	Token Budget
<10×	~97%	100-256
10-15×	80-95%	64-100
~20×	~60%	64

🔧 Training

Stage 1: Train DeepEncoder

python scripts/train_encoder.py \
    --data_path path/to/ocr_data \
    --output_dir checkpoints/encoder \
    --epochs 2 \
    --batch_size 32 \
    --lr 5e-5

Stage 2: Train Full Model

python scripts/train_ocr.py \
    --encoder_checkpoint checkpoints/encoder/best.pt \
    --data_path path/to/ocr_data \
    --output_dir checkpoints/full_model \
    --epochs 5 \
    --batch_size 16 \
    --lr 3e-5

Training Data Mix (Paper):

70% OCR data (layout + non-layout)
20% General vision data
10% Text-only data

Note: This implementation provides the architecture. Training requires large-scale data and compute similar to the paper (20 nodes × 8 A100-40G).

Project Structure

Deepseek-OCR/
├── configs/              # Model and mode configurations
│   ├── model.yaml
│   └── modes.yaml
├── models/               # Model implementations
│   ├── deepencoder/
│   │   ├── sam_stage.py      # Window attention stage
│   │   ├── compressor.py     # 16× token compressor
│   │   ├── clip_stage.py     # Global attention stage
│   │   └── deepencoder.py    # Full encoder
│   ├── decoders/
│   │   ├── lm_decoder.py     # LM decoder
│   │   └── moe_layers.py     # MoE (optional)
│   └── ocr_model.py          # End-to-end model
├── utils/                # Utilities
│   ├── image_io.py          # Image loading & preprocessing
│   ├── metrics.py           # CER, WER metrics
│   ├── prompts.py           # OCR prompts
│   └── ...
├── scripts/              # Training & inference
│   ├── infer.py
│   ├── train_encoder.py
│   └── train_ocr.py
├── evaluation/           # Evaluation scripts
│   ├── evaluate_ocr.py
│   └── baselines/
│       ├── tesseract_eval.py
│       └── paddle_eval.py
├── demo/                 # Demo application
│   └── app.py
├── data/                 # Data directory
└── docs/                 # Documentation
    └── ARCHITECTURE.md

Paper Summary

Key Contributions

Contexts Optical Compression: First systematic study of compressing long text contexts via vision pathway
DeepEncoder Design: Two-stage encoder (window + global attention) with 16× compression
Compression-Accuracy Analysis: Quantifies feasibility of vision-based context compression
Practical Performance: Matches/exceeds SOTA OCR with far fewer tokens

Innovations

Low-Activation Design: Window attention + compression before global attention
Multi-Resolution Support: Single model handles 512² to 1280² (+ tiling for ultra-high-res)
Token Efficiency: 100-400 tokens vs 1000s in traditional approaches
Research Applications: Context compression, memory forgetting mechanisms in LLMs

Citation

@article{wei2025deepseek,
  title={DeepSeek-OCR: Contexts Optical Compression},
  author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
  journal={arXiv preprint arXiv:2510.18234},
  year={2025}
}

Implementation Notes

Differences from Paper

This is a research-friendly implementation that maintains architectural fidelity while being compute-accessible:

Component	Paper	Our Implementation
SAM Stage	SAM-base (80M)	Swin-Tiny (timm)
CLIP Stage	CLIP-large (300M)	CLIP-base (HF)
Decoder	DeepSeek-3B-MoE	GPT-2 (HF)
Training Scale	20×8 A100-40G	Single GPU friendly
Total Params	~950M (~570M active)	~300M

Design Choices

Compute-Friendly Backbones: Use smaller but architecturally similar models
HuggingFace Integration: Leverage pretrained CLIP and LM weights
Modular Design: Easy to swap components (e.g., upgrade to larger decoder)
Fallbacks: Simple implementations when optional deps (timm, HF) unavailable

Development

Testing

# Test model forward pass
python -c "from models.ocr_model import DeepSeekOCR; model = DeepSeekOCR(); print('Model loaded:', model.get_num_params())"

# Test image loading
python -c "from utils.image_io import load_image; img = load_image('test.jpg', 'base'); print('Shape:', img.shape)"

Adding Custom Backbones

Replace components in models/deepencoder/:

# Example: Use different SAM-like backbone
from models.deepencoder.sam_stage import SAMLikeStage

stage1 = SAMLikeStage(
    model_name="swin_base_patch4_window7_224",  # Larger Swin
    embed_dim=256,
    pretrained=True
)

Resources

Paper: arXiv:2510.18234
Official Repo: deepseek-ai/DeepSeek-OCR
Model Weights: HuggingFace
Related Work:
- DeepSeek-VL - Vision-Language foundation
- Vary - Vision vocabulary
- InternVL - Multi-resolution VLM

Contributing

Contributions welcome! Areas:

Training scripts with distributed support
Gundam (tiling) mode implementation
Layout-aware OCR prompts
Multilingual evaluation
MoE decoder integration
More baseline comparisons

License

MIT License. See LICENSE for details.

Acknowledgments

DeepSeek AI team for the original paper and research
HuggingFace for transformers library
timm library for vision backbones
Open-source OCR community (Tesseract, PaddleOCR)

Note: This is a research implementation for educational purposes. For production use, consider the official DeepSeek-OCR model or training on large-scale data

Star this repo if you find it useful :)

Follow me on X (Twitter)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
demo		demo
docs		docs
evaluation		evaluation
models		models
preprocess		preprocess
scripts		scripts
utils		utils
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DeepSeek-OCR: Contexts Optical Compression

Overview

Architecture

Installation

Requirements

Quick Install

Optional Dependencies

Quick Start

Inference

Python API

Web Demo

Evaluation

Prepare Dataset

Run Evaluation

Metrics

Expected Results (Paper)

🔧 Training

Stage 1: Train DeepEncoder

Stage 2: Train Full Model

Project Structure

Paper Summary

Key Contributions

Innovations

Citation

Implementation Notes

Differences from Paper

Design Choices

Development

Testing

Adding Custom Backbones

Resources

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages