A Python library and command-line tool for converting PDF files to text files using PyMuPDF.
- Convert individual PDF files to text
- Batch convert multiple PDF files in a directory
- Preserve directory structure in output
- Parallel processing for improved performance
- Error handling for corrupted or unreadable PDF files
-
Clone the repository:
git clone <repository-url> cd pdf2text
-
Install dependencies:
pip install -r requirements.txt
Or if using pipenv:
pipenv install
Convert all PDF files in a directory:
python -m pdf2text.cli batch <source_directory> [<output_directory>]Examples:
# Convert all PDFs in 'pdfs' directory to text files in 'texts' directory
python -m pdf2text.cli batch pdfs texts
# Convert all PDFs in current directory to text files in 'output' directory
python -m pdf2text.cli batch . output
# Convert with custom number of worker processes
python -m pdf2text.cli batch pdfs texts --workers 8Options:
--workers N: Number of worker processes to use for parallel processing (default: 4)
You can also use the library directly in your Python code:
from pdf2text.batch_converter import process_batch_conversion
# Convert all PDF files in a directory
successful, failed, errors = process_batch_conversion(
source_dir="path/to/pdf/files",
output_dir="path/to/output/text/files",
max_workers=4 # Optional, defaults to 4
)
print(f"Successfully converted: {successful} files")
if failed > 0:
print(f"Failed to convert: {failed} files")
for error in errors:
print(f" - {error}")For converting a single PDF file:
from pdf2text.batch_converter import convert_pdf_to_text
# Convert a single PDF file to text
text = convert_pdf_to_text("path/to/document.pdf")
print(text)The main.py script provides a simple way to convert PDF files:
python main.pyThis will convert all PDF files in the 'pdf' directory to text files in the 'text' directory.
pdf2text/
├── pdf2text/ # Main package
│ ├── __init__.py
│ ├── batch_converter.py # Core conversion logic
│ └── cli.py # Command-line interface
├── main.py # Main entry point
├── pdf/ # Default input directory
├── text/ # Default output directory
├── tests/ # Unit tests
└── specs/ # Project specifications
Run the unit tests:
python -m pytest tests/ -vMIT License