pdfocr

PDF to text extraction CLI tool using OCR.

Quick Start

# Build Docker image
docker build -t pdfocr .

# Process PDF
docker compose run --rm pdfocr /work/path/to/document.pdf

Output: document.txt in same directory.

Documentation

docs/SIMPLE_USAGE.md - Simplest usage guide
docs/DOCKER_QUICKSTART.md - Docker quick reference
docs/DOCKER_USAGE.md - Detailed Docker usage
docs/QUICKSTART.md | 한국어 - Quick start guide
docs/DOCKER.md | 한국어 - Docker deployment
docs/ARCHITECTURE.md | 한국어 - Architecture overview
docs/DEVELOPMENT.md | 한국어 - Development guide

Pipeline

PDF to Image: Convert pages to PNG (pdf2image)
OCR: Extract text (Tesseract OCR)
Output: Save as UTF-8 text file

Local Installation

Setup

chmod +x setup.sh
./setup.sh

Installs dependencies, creates virtual environment, installs packages.

Install CLI (Optional)

./install.sh

Options: system-wide, user-local, or development mode.

Docker Usage

Basic

# Build
docker build -t pdfocr .

# Run
docker compose run --rm pdfocr /work/document.pdf

With Options

# Custom output directory
docker compose run --rm pdfocr /work/document.pdf -o /work/output

# Multiple files
docker compose run --rm pdfocr /work/pdfs/*.pdf --merge

# Custom language (default: eng+kor)
docker compose run --rm pdfocr /work/document.pdf --lang kor

# Keep images for debugging
docker compose run --rm pdfocr /work/document.pdf --keep-images

CLI Usage

After installation:

# Simple
pdfocr document.pdf

# Multiple files
pdfocr *.pdf --merge

# Custom output
pdfocr document.pdf -o ./output

Keep images for debugging

docker compose run --rm pdfocr /work/document.pdf --keep-images -i /work/images

All options

docker compose run --rm pdfocr
/work/test/test_document.pdf
-o /work/test/output
--keep-images -i /work/images
--lang eng+kor --dpi 300


**Key Points**:
- 📁 Output saves to **same directory** as PDF by default
- 📄 Creates `filename.txt` from `filename.pdf`
- 🗂️ Use `/work/...` paths inside container
- 💾 All files persist on your host filesystem

See [Docker Documentation](docs/DOCKER.md) ([한국어](docs/DOCKER.ko.md)) for detailed usage.

### 4. Usage

#### Docker (simplest, no installation):

```bash
# Just specify the PDF - output auto-saves to same directory
docker compose run --rm pdfocr /work/path/to/document.pdf

# With options
docker compose run --rm pdfocr /work/document.pdf -o /work/output --lang eng+kor

After CLI installation:

# Simple - creates document.txt in same directory
pdfocr ~/Documents/lecture.pdf

# Multiple files
pdfocr /path/to/*.pdf --merge

# Custom output directory
pdfocr document.pdf -o ./output

Project Structure

pdfocr/
├── src/pdfocr/          # Main package
│   ├── main.py          # Pipeline entry point
│   ├── pdf_to_image.py  # PDF converter
│   ├── image_to_text.py # OCR module
│   ├── layout.py        # Layout analysis
│   ├── block_ocr.py     # Block-based OCR
│   └── types.py         # Type definitions
├── docs/                # Documentation
│   ├── SIMPLE_USAGE.md
│   ├── DOCKER_QUICKSTART.md
│   ├── DOCKER_USAGE.md
│   ├── QUICKSTART.md / QUICKSTART.ko.md
│   ├── DOCKER.md / DOCKER.ko.md
│   ├── ARCHITECTURE.md / ARCHITECTURE.ko.md
│   └── DEVELOPMENT.md / DEVELOPMENT.ko.md
├── test/                # Test files and scripts
├── pdfocr               # CLI executable
├── main.py              # CLI entry point
├── requirements.txt     # Python dependencies
├── setup.sh             # Environment setup
├── install.sh           # CLI installation
├── Dockerfile           # Docker image definition
└── docker-compose.yml   # Docker compose config

Requirements

System: poppler-utils, tesseract-ocr Python: pdf2image, pytesseract, Pillow, opencv-python

CLI Options

pdfocr [OPTIONS] <PDF_FILES...>

Options:
  -o, --output-dir DIR  Output directory (default: same as PDF)
  -i, --image-dir DIR   Temporary image directory
  -l, --lang LANG       OCR language (default: eng+kor)
  -d, --dpi DPI         Image resolution (default: 300)
  --keep-images         Keep temporary images
  --merge              Merge all outputs into one file

Examples

# Basic
pdfocr document.pdf

# Multiple files
pdfocr *.pdf --merge

# Custom options
pdfocr document.pdf -o ./output --dpi 600 --keep-images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdfocr

Quick Start

Documentation

Pipeline

Local Installation

Setup

Install CLI (Optional)

Docker Usage

Basic

With Options

CLI Usage

Keep images for debugging

All options

After CLI installation:

Project Structure

Requirements

CLI Options

Examples

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
src/pdfocr		src/pdfocr
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
install.sh		install.sh
main.py		main.py
pdfocr		pdfocr
pdfocr-docker		pdfocr-docker
requirements.txt		requirements.txt
setup.sh		setup.sh
test-docker.fish		test-docker.fish
test.fish		test.fish

zeetee1235/pdfocr

Folders and files

Latest commit

History

Repository files navigation

pdfocr

Quick Start

Documentation

Pipeline

Local Installation

Setup

Install CLI (Optional)

Docker Usage

Basic

With Options

CLI Usage

Keep images for debugging

All options

After CLI installation:

Project Structure

Requirements

CLI Options

Examples

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages