-
Notifications
You must be signed in to change notification settings - Fork 181
Description
OCR_AGENT environment variable is ignored - API always uses Tesseract
Description
The OCR_AGENT environment variable is not respected by the API. Even when OCR_AGENT is set to use PaddleOCR (or any other OCR agent), the API always uses Tesseract because the ocr_agent parameter is never passed to the partition() function.
Steps to Reproduce
- Build a container with PaddleOCR installed:
FROM downloads.unstructured.io/unstructured-io/unstructured-api:latest
RUN pip install "paddlepaddle>=3.0.0b1" "unstructured.paddleocr==2.10.0"
ENV OCR_AGENT=unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle-
Start the container and send a PDF for processing
-
Observe that Tesseract is used instead of PaddleOCR:
$ docker exec unstructured ps aux | grep tesseract
tesseract /tmp/tess_xxx_input.PNG /tmp/tess_xxx -l eng -c tessedit_create_hocr=1Expected Behavior
When OCR_AGENT environment variable is set, the API should use that OCR agent for processing.
Actual Behavior
The API ignores the OCR_AGENT environment variable and always uses Tesseract.
Root Cause
In prepline_general/api/general.py, the partition_kwargs dictionary does not include ocr_agent. The partition() function is called without this parameter, so it defaults to OCR_AGENT_TESSERACT.
The env_config.OCR_AGENT property correctly reads the environment variable, but it's never used when calling partition.
Current code (around line 580-600 in general.py):
partition_kwargs = {
"strategy": strategy,
"xml_keep_tags": xml_keep_tags,
"languages": languages,
# ... other params ...
# NOTE: ocr_agent is missing!
}
elements = partition(**partition_kwargs)Proposed Fix
Add ocr_agent to partition_kwargs:
from unstructured.partition.utils.config import env_config
partition_kwargs = {
"strategy": strategy,
"ocr_agent": env_config.OCR_AGENT, # Add this line
"xml_keep_tags": xml_keep_tags,
# ... rest of params ...
}Workaround
Patch the general.py file in the container:
RUN sed -i \
-e '1a from unstructured.partition.utils.config import env_config' \
-e 's/"strategy": strategy,/"strategy": strategy,\n "ocr_agent": env_config.OCR_AGENT,/' \
/home/notebook-user/prepline_general/api/general.pyEnvironment
- unstructured-api: latest (as of Dec 2024)
- unstructured: 0.18.18
- Docker image:
downloads.unstructured.io/unstructured-io/unstructured-api:latest
Additional Context
This issue affects anyone trying to use an alternative OCR agent (PaddleOCR, Google Vision OCR) with the self-hosted API. The documentation at https://docs.unstructured.io/open-source/core-functionality/set-the-ocr-agent suggests setting the OCR_AGENT environment variable, but this doesn't work with the API due to this bug.