Skip to content

sjohn2/pdf-chapter-splitter

Repository files navigation

PDF Chapter Splitter

A powerful Python tool that automatically splits PDF textbooks into individual chapter files based on the table of contents. Perfect for students, researchers, and anyone who needs to work with specific chapters from large PDF documents.

πŸš€ Features

  • Smart OCR Integration: Automatically detects and extracts text from scanned PDFs using OCR
  • Footer Page Verification: Uses PDF footer page numbers to ensure accurate chapter extraction
  • Multi-format Support: Handles various table of contents formats and layouts
  • Batch Processing: Processes entire PDFs and creates individual chapter files
  • Zip Archive Creation: Automatically creates a zip file containing all chapters
  • Chapter Numbering: Maintains original chapter numbers in filenames
  • Complete Content: Extracts full chapter content including sub-sections and summaries

πŸ“‹ Requirements

  • Python 3.8 or higher
  • Tesseract OCR engine
  • Virtual environment (recommended)

πŸ› οΈ Installation

1. Clone the Repository

git clone https://github.com/[YOUR_USERNAME]/pdf-splitter.git
cd pdf-splitter

2. Install Tesseract OCR

On macOS:

brew install tesseract

On Ubuntu/Debian:

sudo apt-get install tesseract-ocr

On Windows: Download and install from Tesseract GitHub

3. Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

πŸ“– Usage

Basic Usage

python main.py "path/to/your/textbook.pdf"

Example

python main.py "Cambridge IGCSE Biology Workbook.pdf"

Output

The script will:

  1. Analyze the PDF and extract chapter information from the table of contents
  2. Create a folder named {pdf_name}_chapters
  3. Generate individual PDF files for each chapter
  4. Create a zip archive containing all chapters

Output Structure

your_textbook_chapters/
β”œβ”€β”€ Chapter_01_Introduction.pdf
β”œβ”€β”€ Chapter_02_Basic_Concepts.pdf
β”œβ”€β”€ Chapter_03_Advanced_Topics.pdf
└── ...
your_textbook_chapters.zip

πŸ”§ How It Works

1. Content Detection

  • Scans specified pages (default: page 4) for the table of contents
  • Uses OCR to extract text from scanned PDFs
  • Parses chapter titles and page numbers

2. Page Index Mapping

  • Reads actual page numbers from PDF footers
  • Maps chapter page numbers to correct PDF indices
  • Ensures accurate content extraction

3. Chapter Extraction

  • Creates individual PDF files for each chapter
  • Maintains original page order and content
  • Includes all sub-sections and summaries

πŸ“ Project Structure

pdf-splitter/
β”œβ”€β”€ main.py                 # Main script
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ LICENSE                # MIT License
β”œβ”€β”€ .gitignore            # Git ignore file
β”œβ”€β”€ CONTRIBUTING.md       # Contribution guidelines
β”œβ”€β”€ CHANGELOG.md          # Version history
└── examples/             # Example PDFs and outputs
    β”œβ”€β”€ sample_input.pdf
    └── sample_output/

πŸ§ͺ Testing

Test with Sample PDF

# Run the script on a test PDF
python main.py "examples/sample_input.pdf"

# Verify output
ls examples/sample_output/

Verify Chapter Content

# Check that chapter PDFs have correct page numbers
python -c "
import fitz
doc = fitz.open('output/Chapter_01_Example.pdf')
print(f'Pages: {doc.page_count}')
print(f'Footer page numbers: {[doc[i].get_text(\"text\", clip=fitz.Rect(0, doc[i].rect.height*0.9, doc[i].rect.width, doc[i].rect.height))[:10] for i in range(min(3, doc.page_count))]}')
"

πŸ” Troubleshooting

Common Issues

1. Tesseract not found

# Ensure Tesseract is installed and in PATH
tesseract --version

2. OCR not working properly

  • Check that the PDF has readable text or clear scanned images
  • Ensure good image quality for scanned PDFs

3. Wrong chapter detection

  • Verify the table of contents is on the expected page (default: page 4)
  • Check that chapter titles follow the expected format

4. Missing chapters

  • The script looks for chapters on page 4 by default
  • If your PDF has contents on different pages, modify the content_pages parameter

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Development Setup

# Fork and clone the repository
git clone https://github.com/[YOUR_USERNAME]/pdf-splitter.git
cd pdf-splitter

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install development dependencies
pip install pytest black flake8

Running Tests

# Run tests
pytest tests/

# Format code
black .

# Lint code
flake8 .

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“Š Statistics

  • Supported Formats: PDF with text or scanned images
  • Chapter Detection: Automatic from table of contents
  • Accuracy: Footer-verified page extraction
  • Output: Individual chapter PDFs + zip archive

πŸ”„ Version History

See CHANGELOG.md for a complete version history.

πŸ“ž Support


Made with ❀️ for the open source community

About

A powerful Python tool that automatically splits PDF textbooks into individual chapter files

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published