A powerful Python tool that automatically splits PDF textbooks into individual chapter files based on the table of contents. Perfect for students, researchers, and anyone who needs to work with specific chapters from large PDF documents.
- Smart OCR Integration: Automatically detects and extracts text from scanned PDFs using OCR
- Footer Page Verification: Uses PDF footer page numbers to ensure accurate chapter extraction
- Multi-format Support: Handles various table of contents formats and layouts
- Batch Processing: Processes entire PDFs and creates individual chapter files
- Zip Archive Creation: Automatically creates a zip file containing all chapters
- Chapter Numbering: Maintains original chapter numbers in filenames
- Complete Content: Extracts full chapter content including sub-sections and summaries
- Python 3.8 or higher
- Tesseract OCR engine
- Virtual environment (recommended)
git clone https://github.com/[YOUR_USERNAME]/pdf-splitter.git
cd pdf-splitterOn macOS:
brew install tesseractOn Ubuntu/Debian:
sudo apt-get install tesseract-ocrOn Windows: Download and install from Tesseract GitHub
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtpython main.py "path/to/your/textbook.pdf"python main.py "Cambridge IGCSE Biology Workbook.pdf"The script will:
- Analyze the PDF and extract chapter information from the table of contents
- Create a folder named
{pdf_name}_chapters - Generate individual PDF files for each chapter
- Create a zip archive containing all chapters
your_textbook_chapters/
βββ Chapter_01_Introduction.pdf
βββ Chapter_02_Basic_Concepts.pdf
βββ Chapter_03_Advanced_Topics.pdf
βββ ...
your_textbook_chapters.zip
- Scans specified pages (default: page 4) for the table of contents
- Uses OCR to extract text from scanned PDFs
- Parses chapter titles and page numbers
- Reads actual page numbers from PDF footers
- Maps chapter page numbers to correct PDF indices
- Ensures accurate content extraction
- Creates individual PDF files for each chapter
- Maintains original page order and content
- Includes all sub-sections and summaries
pdf-splitter/
βββ main.py # Main script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ LICENSE # MIT License
βββ .gitignore # Git ignore file
βββ CONTRIBUTING.md # Contribution guidelines
βββ CHANGELOG.md # Version history
βββ examples/ # Example PDFs and outputs
βββ sample_input.pdf
βββ sample_output/
# Run the script on a test PDF
python main.py "examples/sample_input.pdf"
# Verify output
ls examples/sample_output/# Check that chapter PDFs have correct page numbers
python -c "
import fitz
doc = fitz.open('output/Chapter_01_Example.pdf')
print(f'Pages: {doc.page_count}')
print(f'Footer page numbers: {[doc[i].get_text(\"text\", clip=fitz.Rect(0, doc[i].rect.height*0.9, doc[i].rect.width, doc[i].rect.height))[:10] for i in range(min(3, doc.page_count))]}')
"1. Tesseract not found
# Ensure Tesseract is installed and in PATH
tesseract --version2. OCR not working properly
- Check that the PDF has readable text or clear scanned images
- Ensure good image quality for scanned PDFs
3. Wrong chapter detection
- Verify the table of contents is on the expected page (default: page 4)
- Check that chapter titles follow the expected format
4. Missing chapters
- The script looks for chapters on page 4 by default
- If your PDF has contents on different pages, modify the
content_pagesparameter
We welcome contributions! Please see CONTRIBUTING.md for details.
# Fork and clone the repository
git clone https://github.com/[YOUR_USERNAME]/pdf-splitter.git
cd pdf-splitter
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install development dependencies
pip install pytest black flake8# Run tests
pytest tests/
# Format code
black .
# Lint code
flake8 .This project is licensed under the MIT License - see the LICENSE file for details.
- PyMuPDF for PDF processing
- Tesseract OCR for text extraction
- Pillow for image processing
- Supported Formats: PDF with text or scanned images
- Chapter Detection: Automatic from table of contents
- Accuracy: Footer-verified page extraction
- Output: Individual chapter PDFs + zip archive
See CHANGELOG.md for a complete version history.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [sajanjohn at the rate google mail]
Made with β€οΈ for the open source community