PDF to text extraction CLI tool using OCR.
# Build Docker image
docker build -t pdfocr .
# Process PDF
docker compose run --rm pdfocr /work/path/to/document.pdfOutput: document.txt in same directory.
- docs/SIMPLE_USAGE.md - Simplest usage guide
- docs/DOCKER_QUICKSTART.md - Docker quick reference
- docs/DOCKER_USAGE.md - Detailed Docker usage
- docs/QUICKSTART.md | 한국어 - Quick start guide
- docs/DOCKER.md | 한국어 - Docker deployment
- docs/ARCHITECTURE.md | 한국어 - Architecture overview
- docs/DEVELOPMENT.md | 한국어 - Development guide
- PDF to Image: Convert pages to PNG (pdf2image)
- OCR: Extract text (Tesseract OCR)
- Output: Save as UTF-8 text file
chmod +x setup.sh
./setup.shInstalls dependencies, creates virtual environment, installs packages.
./install.shOptions: system-wide, user-local, or development mode.
# Build
docker build -t pdfocr .
# Run
docker compose run --rm pdfocr /work/document.pdf# Custom output directory
docker compose run --rm pdfocr /work/document.pdf -o /work/output
# Multiple files
docker compose run --rm pdfocr /work/pdfs/*.pdf --merge
# Custom language (default: eng+kor)
docker compose run --rm pdfocr /work/document.pdf --lang kor
# Keep images for debugging
docker compose run --rm pdfocr /work/document.pdf --keep-imagesAfter installation:
# Simple
pdfocr document.pdf
# Multiple files
pdfocr *.pdf --merge
# Custom output
pdfocr document.pdf -o ./outputdocker compose run --rm pdfocr /work/document.pdf --keep-images -i /work/images
docker compose run --rm pdfocr
/work/test/test_document.pdf
-o /work/test/output
--keep-images -i /work/images
--lang eng+kor --dpi 300
**Key Points**:
- 📁 Output saves to **same directory** as PDF by default
- 📄 Creates `filename.txt` from `filename.pdf`
- 🗂️ Use `/work/...` paths inside container
- 💾 All files persist on your host filesystem
See [Docker Documentation](docs/DOCKER.md) ([한국어](docs/DOCKER.ko.md)) for detailed usage.
### 4. Usage
#### Docker (simplest, no installation):
```bash
# Just specify the PDF - output auto-saves to same directory
docker compose run --rm pdfocr /work/path/to/document.pdf
# With options
docker compose run --rm pdfocr /work/document.pdf -o /work/output --lang eng+kor
# Simple - creates document.txt in same directory
pdfocr ~/Documents/lecture.pdf
# Multiple files
pdfocr /path/to/*.pdf --merge
# Custom output directory
pdfocr document.pdf -o ./outputpdfocr/
├── src/pdfocr/ # Main package
│ ├── main.py # Pipeline entry point
│ ├── pdf_to_image.py # PDF converter
│ ├── image_to_text.py # OCR module
│ ├── layout.py # Layout analysis
│ ├── block_ocr.py # Block-based OCR
│ └── types.py # Type definitions
├── docs/ # Documentation
│ ├── SIMPLE_USAGE.md
│ ├── DOCKER_QUICKSTART.md
│ ├── DOCKER_USAGE.md
│ ├── QUICKSTART.md / QUICKSTART.ko.md
│ ├── DOCKER.md / DOCKER.ko.md
│ ├── ARCHITECTURE.md / ARCHITECTURE.ko.md
│ └── DEVELOPMENT.md / DEVELOPMENT.ko.md
├── test/ # Test files and scripts
├── pdfocr # CLI executable
├── main.py # CLI entry point
├── requirements.txt # Python dependencies
├── setup.sh # Environment setup
├── install.sh # CLI installation
├── Dockerfile # Docker image definition
└── docker-compose.yml # Docker compose config
System: poppler-utils, tesseract-ocr Python: pdf2image, pytesseract, Pillow, opencv-python
pdfocr [OPTIONS] <PDF_FILES...>
Options:
-o, --output-dir DIR Output directory (default: same as PDF)
-i, --image-dir DIR Temporary image directory
-l, --lang LANG OCR language (default: eng+kor)
-d, --dpi DPI Image resolution (default: 300)
--keep-images Keep temporary images
--merge Merge all outputs into one file
# Basic
pdfocr document.pdf
# Multiple files
pdfocr *.pdf --merge
# Custom options
pdfocr document.pdf -o ./output --dpi 600 --keep-images