Enterprise-grade document parsing service with asynchronous queue processing based on Celery, featuring a fully decoupled API/Worker architecture.
- 🚀 Asynchronous Processing: Distributed task queue based on Celery
- 📄 Multi-format Support: PDF, Office, images, and various document formats
- 🔄 High Availability: Supports task retry and fault recovery
- 📊 Real-time Monitoring: Task status tracking and queue statistics
- 🎯 Priority Queue: Supports task priority scheduling
- 🔧 Easy to Extend: Modular design, easy to add new parsing engines
- Docker and Docker Compose
- (Optional) NVIDIA GPU for GPU worker
4 steps to start:
-
Copy configuration files:
# Project root cp .env.example .env cd docker && cp .env.example .env
-
Configure service selection (in
docker/.env):cd docker # Edit .env file, set COMPOSE_PROFILES (choose one) # Option 1: GPU Worker + internal Redis (default, requires NVIDIA GPU) COMPOSE_PROFILES=redis,mineru-gpu # Option 2: CPU Worker + internal Redis (recommended for development) # COMPOSE_PROFILES=redis,mineru-cpu
💡 Notes:
- Default:
COMPOSE_PROFILES=redis,mineru-gpu(GPU Worker) - Control which services start via
COMPOSE_PROFILES(Redis and Worker) - API and Cleanup services start automatically (no profile, required services)
- Default:
-
Build images:
cd docker # Simplest: run directly (automatically selects CPU or GPU Worker based on COMPOSE_PROFILES) sh build.sh # Or manually specify (build.sh supports parameters to build only needed services) # GPU Worker: sh build.sh --api --worker-gpu # CPU Worker: sh build.sh --api --worker-cpu
-
Start services:
cd docker # Simplest: start directly (automatically starts configured services based on COMPOSE_PROFILES) docker compose up -d # Or manually specify (equivalent ways) # GPU Worker: docker compose --profile redis --profile mineru-gpu up -d # CPU Worker: docker compose --profile redis --profile mineru-cpu up -d
-
Verify services:
curl http://localhost:8000/api/v1/health
That's it! The API is now running at http://localhost:8000.
💡 Tips:
- After configuring
COMPOSE_PROFILES, bothsh build.shanddocker compose up -dwill automatically recognize itsh build.shwithout parameters automatically selects CPU or GPU Worker based onCOMPOSE_PROFILES- You can also use parameters to explicitly specify:
sh build.sh --api --worker-gpuorsh build.sh --api --worker-cpu- See docker/README.md for more configuration options
MinerU-API provides two API interfaces to suit different use cases:
The /file_parse endpoint is compatible with the official MinerU API format. It submits tasks to the worker and waits for completion, returning results directly in the response.
Reference: MinerU Official API
curl -X POST "http://localhost:8000/file_parse" \
-F "files=@document.pdf" \
-F "backend=pipeline" \
-F "lang_list=ch" \
-F "parse_method=auto" \
-F "return_md=true"Use cases: Simple integration, immediate results needed, compatible with existing MinerU clients.
The /api/v1/tasks/submit and /api/v1/tasks/{task_id} endpoints provide an asynchronous queue-based API, compatible with the mineru-tianshu project format.
Reference: mineru-tianshu API
Submit a Task:
curl -X POST "http://localhost:8000/api/v1/tasks/submit" \
-F "file=@document.pdf" \
-F "backend=pipeline" \
-F "lang=ch"Query Task Status:
curl "http://localhost:8000/api/v1/tasks/{task_id}"Use cases: Production deployments, batch processing, long-running tasks, better scalability.
Visit http://localhost:8000/docs for interactive API documentation with full parameter details.
The most important configuration options (see .env.example for all options):
# Redis Configuration
REDIS_URL=redis://redis:6379/0
# Storage Type: local or s3
MINERU_STORAGE_TYPE=local
# For S3 storage (distributed deployment)
MINERU_S3_ENDPOINT=http://minio:9000
MINERU_S3_ACCESS_KEY=minioadmin
MINERU_S3_SECRET_KEY=minioadmin
# CORS Configuration (production)
CORS_ALLOWED_ORIGINS=http://localhost:3000
ENVIRONMENT=production
# File Upload Limits
MAX_FILE_SIZE=104857600 # 100MB- 📖 Full Documentation - Complete guide and configuration (English | 中文)
- 🚀 Deployment Guide - Production deployment (中文)
- ⚙️ Configuration Reference - All configuration options (中文)
- 💡 API Examples - Code examples in multiple languages (中文)
- 🔧 Troubleshooting - Common issues and solutions (中文)
- 🧹 Storage & Cleanup - Storage configuration and cleanup (中文)
- API Service: Handles task submission and status queries (
api/app.py) - Worker Service: Processes documents using MinerU/MarkItDown (
worker/tasks.py) - Redis: Message queue and result storage
- Shared Config: Unified configuration in
shared/celeryconfig.py
For detailed development environment setup instructions, see docs/DEVELOPMENT.md.
Quick Start:
# Use the automated setup script (recommended)
chmod +x setup_venv.sh
./setup_venv.sh
# Or manually:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install --upgrade pip setuptools wheel
pip install -r api/requirements.txt
pip install -r worker/requirements.txt
pip install -r cleanup/requirements.txtWe welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is built on top of the following excellent open-source projects:
- MinerU - The core document parsing engine that powers this service
- mineru-tianshu - Inspiration and reference for the API architecture
We are grateful to the developers and contributors of these projects for their valuable work.
MIT License - see LICENSE file for details.
This project uses the following open-source libraries:
MinerU is used as an external library and its source code is not included in this repository.