Skip to content

OCR_AGENT environment variable is ignored - API always uses Tesseract #532

@137137137

Description

@137137137

OCR_AGENT environment variable is ignored - API always uses Tesseract

Description

The OCR_AGENT environment variable is not respected by the API. Even when OCR_AGENT is set to use PaddleOCR (or any other OCR agent), the API always uses Tesseract because the ocr_agent parameter is never passed to the partition() function.

Steps to Reproduce

  1. Build a container with PaddleOCR installed:
FROM downloads.unstructured.io/unstructured-io/unstructured-api:latest
RUN pip install "paddlepaddle>=3.0.0b1" "unstructured.paddleocr==2.10.0"
ENV OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle
  1. Start the container and send a PDF for processing

  2. Observe that Tesseract is used instead of PaddleOCR:

$ docker exec unstructured ps aux | grep tesseract
tesseract /tmp/tess_xxx_input.PNG /tmp/tess_xxx -l eng -c tessedit_create_hocr=1

Expected Behavior

When OCR_AGENT environment variable is set, the API should use that OCR agent for processing.

Actual Behavior

The API ignores the OCR_AGENT environment variable and always uses Tesseract.

Root Cause

In prepline_general/api/general.py, the partition_kwargs dictionary does not include ocr_agent. The partition() function is called without this parameter, so it defaults to OCR_AGENT_TESSERACT.

The env_config.OCR_AGENT property correctly reads the environment variable, but it's never used when calling partition.

Current code (around line 580-600 in general.py):

partition_kwargs = {
    "strategy": strategy,
    "xml_keep_tags": xml_keep_tags,
    "languages": languages,
    # ... other params ...
    # NOTE: ocr_agent is missing!
}
elements = partition(**partition_kwargs)

Proposed Fix

Add ocr_agent to partition_kwargs:

from unstructured.partition.utils.config import env_config

partition_kwargs = {
    "strategy": strategy,
    "ocr_agent": env_config.OCR_AGENT,  # Add this line
    "xml_keep_tags": xml_keep_tags,
    # ... rest of params ...
}

Workaround

Patch the general.py file in the container:

RUN sed -i \
    -e '1a from unstructured.partition.utils.config import env_config' \
    -e 's/"strategy": strategy,/"strategy": strategy,\n            "ocr_agent": env_config.OCR_AGENT,/' \
    /home/notebook-user/prepline_general/api/general.py

Environment

  • unstructured-api: latest (as of Dec 2024)
  • unstructured: 0.18.18
  • Docker image: downloads.unstructured.io/unstructured-io/unstructured-api:latest

Additional Context

This issue affects anyone trying to use an alternative OCR agent (PaddleOCR, Google Vision OCR) with the self-hosted API. The documentation at https://docs.unstructured.io/open-source/core-functionality/set-the-ocr-agent suggests setting the OCR_AGENT environment variable, but this doesn't work with the API due to this bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions