PDF to Text Converter

A Python library and command-line tool for converting PDF files to text files using PyMuPDF.

Features

Convert individual PDF files to text
Batch convert multiple PDF files in a directory
Preserve directory structure in output
Parallel processing for improved performance
Error handling for corrupted or unreadable PDF files

Installation

Clone the repository:
```
git clone <repository-url>
cd pdf2text
```

Install dependencies:

pip install -r requirements.txt

Or if using pipenv:

pipenv install

Usage

Command-Line Interface

Convert all PDF files in a directory:

python -m pdf2text.cli batch <source_directory> [<output_directory>]

Examples:

# Convert all PDFs in 'pdfs' directory to text files in 'texts' directory
python -m pdf2text.cli batch pdfs texts

# Convert all PDFs in current directory to text files in 'output' directory
python -m pdf2text.cli batch . output

# Convert with custom number of worker processes
python -m pdf2text.cli batch pdfs texts --workers 8

Options:

--workers N: Number of worker processes to use for parallel processing (default: 4)

Library Usage

You can also use the library directly in your Python code:

from pdf2text.batch_converter import process_batch_conversion

# Convert all PDF files in a directory
successful, failed, errors = process_batch_conversion(
    source_dir="path/to/pdf/files",
    output_dir="path/to/output/text/files",
    max_workers=4  # Optional, defaults to 4
)

print(f"Successfully converted: {successful} files")
if failed > 0:
    print(f"Failed to convert: {failed} files")
    for error in errors:
        print(f"  - {error}")

For converting a single PDF file:

from pdf2text.batch_converter import convert_pdf_to_text

# Convert a single PDF file to text
text = convert_pdf_to_text("path/to/document.pdf")
print(text)

Main Entry Point

The main.py script provides a simple way to convert PDF files:

python main.py

This will convert all PDF files in the 'pdf' directory to text files in the 'text' directory.

Project Structure

pdf2text/
├── pdf2text/           # Main package
│   ├── __init__.py
│   ├── batch_converter.py  # Core conversion logic
│   └── cli.py          # Command-line interface
├── main.py             # Main entry point
├── pdf/                # Default input directory
├── text/               # Default output directory
├── tests/              # Unit tests
└── specs/              # Project specifications

Testing

Run the unit tests:

python -m pytest tests/ -v

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
pdf2text		pdf2text
specs/batch_conversion		specs/batch_conversion
tests		tests
.gitignore		.gitignore
.python-version		.python-version
QWEN.md		QWEN.md
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
main.py		main.py
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF to Text Converter

Features

Installation

Usage

Command-Line Interface

Library Usage

Main Entry Point

Project Structure

Testing

License

About

Uh oh!

Releases

Packages

Languages

Code-Yay-Mal/pdf2text

Folders and files

Latest commit

History

Repository files navigation

PDF to Text Converter

Features

Installation

Usage

Command-Line Interface

Library Usage

Main Entry Point

Project Structure

Testing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages