Skip to content

OSError: 页面文件太小 (Page file too small) during PDF extraction with marker/surya #14

@YLChen-007

Description

@YLChen-007

Issue Description

Describe the bug
I encountered an OSError: 页面文件太小,无法完成操作。 (os error 1455) (English: "The paging file is too small for this operation to complete") when running the script to process a PDF.

It seems the crash happens during the initialization of the marker-pdf / surya models for content extraction. The script attempted to fallback to CPU after the initial failure but failed again with the same error.

To Reproduce
Run the following command on Windows:

python main.py "E:\temp\agentfuzz-security25.pdf" --language en --model gpt-4o --theme Madrid --output-dir output --verbose

Error Log

2025-12-15 11:53:04,688 - INFO - Starting marker-pdf content extraction: E:\temp\agentfuzz-security25.pdf
2025-12-15 11:53:04,688 - INFO - Initializing models... (device preference: None)
2025-12-15 11:53:06,062 - WARNING - Model initialization failed: 页面文件太小,无法完成操作。 (os error 1455). Retrying with device='cpu'...
2025-12-15 11:53:07,387 - ERROR - Content extraction failed: 页面文件太小,无法完成操作。 (os error 1455)
Traceback (most recent call last):
  File "D:\sourcecode\Auto-Slides\modules\lightweight_extractor.py", line 76, in extract_content
    converter = PdfConverter(artifact_dict=create_model_dict(device=device))
  ...
  File "D:\sourcecode\Auto-Slides\venv\Lib\site-packages\transformers\modeling_utils.py", line 4450, in from_pretrained
    with safe_open(checkpoint_files[0], framework="pt") as f:
OSError: 页面文件太小,无法完成操作。 (os error 1455)

Environment

  • OS: Windows
  • Project Path: D:\sourcecode\Auto-Slides
  • Task: PDF Content Extraction

Additional context
The error 1455 usually indicates that the Windows commit limit (RAM + Page File) has been reached. It appears that loading the surya recognition model consumes a significant amount of memory, triggering this system limit even when falling back to CPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions