-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Feature Request: Add PaddleOCR GPU support to unstructured-api Docker image
Summary
The official unstructured-api Docker image includes Tesseract but not PaddleOCR. Users who want to use PaddleOCR with GPU acceleration must build a custom image. It would be valuable to have an official GPU-enabled image with PaddleOCR pre-installed.
Current Behavior
- The
unstructured-api:latestimage only includes Tesseract OCR - PaddleOCR must be manually installed via
pip install paddlepaddle unstructured-paddleocr - There is no GPU-enabled variant of the image
- The
OCR_AGENTenvironment variable is ignored (see related issue: OCR_AGENT_BUG_ISSUE.md)
Proposed Solution
Option 1: Provide GPU-enabled image tags
Publish additional Docker image variants:
unstructured-api:latest-gpu-cu118 # CUDA 11.8
unstructured-api:latest-gpu-cu126 # CUDA 12.6
These images would include:
paddlepaddle-gpufrom the appropriate CUDA indexunstructured-paddleocr- NVIDIA CUDA runtime
Option 2: Add build args to existing Dockerfile
Add build arguments to allow users to build GPU-enabled images:
ARG USE_GPU=false
ARG CUDA_VERSION=cu118
RUN if [ "$USE_GPU" = "true" ]; then \
pip install --no-cache-dir \
paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/stable/${CUDA_VERSION}/ \
unstructured-paddleocr; \
fiImplementation Details
The unstructured library already supports GPU acceleration for PaddleOCR. In unstructured/partition/utils/ocr_models/paddle_ocr.py:
gpu_available = paddle.device.cuda.device_count() > 0
if gpu_available:
logger.info(f"Loading paddle with GPU on language={language}...")
paddle_ocr = PaddleOCR(
use_angle_cls=True,
use_gpu=gpu_available, # Auto-detects GPU
lang=language,
enable_mkldnn=True,
show_log=False,
)This means the library automatically uses GPU when available - the only requirement is installing paddlepaddle-gpu instead of paddlepaddle.
Workaround
Users can extend the official image:
FROM downloads.unstructured.io/unstructured-io/unstructured-api:latest
USER root
ARG USE_GPU=false
RUN if [ "$USE_GPU" = "true" ]; then \
pip install --no-cache-dir \
paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ \
unstructured-paddleocr; \
else \
pip install --no-cache-dir \
paddlepaddle \
unstructured-paddleocr; \
fi
USER notebook-user
ENV OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddleNote: This also requires patching general.py to pass ocr_agent to partition() - see OCR_AGENT_BUG_ISSUE.md.
Benefits
- Performance: PaddleOCR with GPU is significantly faster than Tesseract for batch processing
- Accuracy: PaddleOCR (especially PP-OCRv4) provides better accuracy on many document types
- Ease of use: Official GPU images eliminate the need for custom Dockerfiles
Environment
- unstructured-api: latest
- unstructured: 0.18.18+
- PaddlePaddle: 3.2.2
- CUDA: 11.8 / 12.6
Related Issues
- OCR_AGENT environment variable is ignored (see OCR_AGENT_BUG_ISSUE.md)