Skip to content

Code-Yay-Mal/pdf2text

Repository files navigation

PDF to Text Converter

A Python library and command-line tool for converting PDF files to text files using PyMuPDF.

Features

  • Convert individual PDF files to text
  • Batch convert multiple PDF files in a directory
  • Preserve directory structure in output
  • Parallel processing for improved performance
  • Error handling for corrupted or unreadable PDF files

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd pdf2text
  2. Install dependencies:

    pip install -r requirements.txt

    Or if using pipenv:

    pipenv install

Usage

Command-Line Interface

Convert all PDF files in a directory:

python -m pdf2text.cli batch <source_directory> [<output_directory>]

Examples:

# Convert all PDFs in 'pdfs' directory to text files in 'texts' directory
python -m pdf2text.cli batch pdfs texts

# Convert all PDFs in current directory to text files in 'output' directory
python -m pdf2text.cli batch . output

# Convert with custom number of worker processes
python -m pdf2text.cli batch pdfs texts --workers 8

Options:

  • --workers N: Number of worker processes to use for parallel processing (default: 4)

Library Usage

You can also use the library directly in your Python code:

from pdf2text.batch_converter import process_batch_conversion

# Convert all PDF files in a directory
successful, failed, errors = process_batch_conversion(
    source_dir="path/to/pdf/files",
    output_dir="path/to/output/text/files",
    max_workers=4  # Optional, defaults to 4
)

print(f"Successfully converted: {successful} files")
if failed > 0:
    print(f"Failed to convert: {failed} files")
    for error in errors:
        print(f"  - {error}")

For converting a single PDF file:

from pdf2text.batch_converter import convert_pdf_to_text

# Convert a single PDF file to text
text = convert_pdf_to_text("path/to/document.pdf")
print(text)

Main Entry Point

The main.py script provides a simple way to convert PDF files:

python main.py

This will convert all PDF files in the 'pdf' directory to text files in the 'text' directory.

Project Structure

pdf2text/
├── pdf2text/           # Main package
│   ├── __init__.py
│   ├── batch_converter.py  # Core conversion logic
│   └── cli.py          # Command-line interface
├── main.py             # Main entry point
├── pdf/                # Default input directory
├── text/               # Default output directory
├── tests/              # Unit tests
└── specs/              # Project specifications

Testing

Run the unit tests:

python -m pytest tests/ -v

License

MIT License

About

Convert pdf file to text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published