Skip to content

Feature Request: Add PaddleOCR GPU support to Docker image #533

@137137137

Description

@137137137

Feature Request: Add PaddleOCR GPU support to unstructured-api Docker image

Summary

The official unstructured-api Docker image includes Tesseract but not PaddleOCR. Users who want to use PaddleOCR with GPU acceleration must build a custom image. It would be valuable to have an official GPU-enabled image with PaddleOCR pre-installed.

Current Behavior

  • The unstructured-api:latest image only includes Tesseract OCR
  • PaddleOCR must be manually installed via pip install paddlepaddle unstructured-paddleocr
  • There is no GPU-enabled variant of the image
  • The OCR_AGENT environment variable is ignored (see related issue: OCR_AGENT_BUG_ISSUE.md)

Proposed Solution

Option 1: Provide GPU-enabled image tags

Publish additional Docker image variants:

unstructured-api:latest-gpu-cu118  # CUDA 11.8
unstructured-api:latest-gpu-cu126  # CUDA 12.6

These images would include:

  • paddlepaddle-gpu from the appropriate CUDA index
  • unstructured-paddleocr
  • NVIDIA CUDA runtime

Option 2: Add build args to existing Dockerfile

Add build arguments to allow users to build GPU-enabled images:

ARG USE_GPU=false
ARG CUDA_VERSION=cu118

RUN if [ "$USE_GPU" = "true" ]; then \
        pip install --no-cache-dir \
            paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/stable/${CUDA_VERSION}/ \
            unstructured-paddleocr; \
    fi

Implementation Details

The unstructured library already supports GPU acceleration for PaddleOCR. In unstructured/partition/utils/ocr_models/paddle_ocr.py:

gpu_available = paddle.device.cuda.device_count() > 0
if gpu_available:
    logger.info(f"Loading paddle with GPU on language={language}...")

paddle_ocr = PaddleOCR(
    use_angle_cls=True,
    use_gpu=gpu_available,  # Auto-detects GPU
    lang=language,
    enable_mkldnn=True,
    show_log=False,
)

This means the library automatically uses GPU when available - the only requirement is installing paddlepaddle-gpu instead of paddlepaddle.

Workaround

Users can extend the official image:

FROM downloads.unstructured.io/unstructured-io/unstructured-api:latest

USER root

ARG USE_GPU=false
RUN if [ "$USE_GPU" = "true" ]; then \
        pip install --no-cache-dir \
            paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ \
            unstructured-paddleocr; \
    else \
        pip install --no-cache-dir \
            paddlepaddle \
            unstructured-paddleocr; \
    fi

USER notebook-user

ENV OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle

Note: This also requires patching general.py to pass ocr_agent to partition() - see OCR_AGENT_BUG_ISSUE.md.

Benefits

  1. Performance: PaddleOCR with GPU is significantly faster than Tesseract for batch processing
  2. Accuracy: PaddleOCR (especially PP-OCRv4) provides better accuracy on many document types
  3. Ease of use: Official GPU images eliminate the need for custom Dockerfiles

Environment

  • unstructured-api: latest
  • unstructured: 0.18.18+
  • PaddlePaddle: 3.2.2
  • CUDA: 11.8 / 12.6

Related Issues

  • OCR_AGENT environment variable is ignored (see OCR_AGENT_BUG_ISSUE.md)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions