PDF Chapter Splitter

A powerful Python tool that automatically splits PDF textbooks into individual chapter files based on the table of contents. Perfect for students, researchers, and anyone who needs to work with specific chapters from large PDF documents.

🚀 Features

Smart OCR Integration: Automatically detects and extracts text from scanned PDFs using OCR
Footer Page Verification: Uses PDF footer page numbers to ensure accurate chapter extraction
Multi-format Support: Handles various table of contents formats and layouts
Batch Processing: Processes entire PDFs and creates individual chapter files
Zip Archive Creation: Automatically creates a zip file containing all chapters
Chapter Numbering: Maintains original chapter numbers in filenames
Complete Content: Extracts full chapter content including sub-sections and summaries

📋 Requirements

Python 3.8 or higher
Tesseract OCR engine
Virtual environment (recommended)

🛠️ Installation

1. Clone the Repository

git clone https://github.com/[YOUR_USERNAME]/pdf-splitter.git
cd pdf-splitter

2. Install Tesseract OCR

On macOS:

brew install tesseract

On Ubuntu/Debian:

sudo apt-get install tesseract-ocr

On Windows: Download and install from Tesseract GitHub

3. Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

📖 Usage

Basic Usage

python main.py "path/to/your/textbook.pdf"

Example

python main.py "Cambridge IGCSE Biology Workbook.pdf"

Output

The script will:

Analyze the PDF and extract chapter information from the table of contents
Create a folder named {pdf_name}_chapters
Generate individual PDF files for each chapter
Create a zip archive containing all chapters

Output Structure

your_textbook_chapters/
├── Chapter_01_Introduction.pdf
├── Chapter_02_Basic_Concepts.pdf
├── Chapter_03_Advanced_Topics.pdf
└── ...
your_textbook_chapters.zip

🔧 How It Works

1. Content Detection

Scans specified pages (default: page 4) for the table of contents
Uses OCR to extract text from scanned PDFs
Parses chapter titles and page numbers

2. Page Index Mapping

Reads actual page numbers from PDF footers
Maps chapter page numbers to correct PDF indices
Ensures accurate content extraction

3. Chapter Extraction

Creates individual PDF files for each chapter
Maintains original page order and content
Includes all sub-sections and summaries

📁 Project Structure

pdf-splitter/
├── main.py                 # Main script
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── LICENSE                # MIT License
├── .gitignore            # Git ignore file
├── CONTRIBUTING.md       # Contribution guidelines
├── CHANGELOG.md          # Version history
└── examples/             # Example PDFs and outputs
    ├── sample_input.pdf
    └── sample_output/

🧪 Testing

Test with Sample PDF

# Run the script on a test PDF
python main.py "examples/sample_input.pdf"

# Verify output
ls examples/sample_output/

Verify Chapter Content

# Check that chapter PDFs have correct page numbers
python -c "
import fitz
doc = fitz.open('output/Chapter_01_Example.pdf')
print(f'Pages: {doc.page_count}')
print(f'Footer page numbers: {[doc[i].get_text(\"text\", clip=fitz.Rect(0, doc[i].rect.height*0.9, doc[i].rect.width, doc[i].rect.height))[:10] for i in range(min(3, doc.page_count))]}')
"

🔍 Troubleshooting

Common Issues

1. Tesseract not found

# Ensure Tesseract is installed and in PATH
tesseract --version

2. OCR not working properly

Check that the PDF has readable text or clear scanned images
Ensure good image quality for scanned PDFs

3. Wrong chapter detection

Verify the table of contents is on the expected page (default: page 4)
Check that chapter titles follow the expected format

4. Missing chapters

The script looks for chapters on page 4 by default
If your PDF has contents on different pages, modify the content_pages parameter

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Development Setup

# Fork and clone the repository
git clone https://github.com/[YOUR_USERNAME]/pdf-splitter.git
cd pdf-splitter

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install development dependencies
pip install pytest black flake8

Running Tests

# Run tests
pytest tests/

# Format code
black .

# Lint code
flake8 .

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PyMuPDF for PDF processing
Tesseract OCR for text extraction
Pillow for image processing

📊 Statistics

Supported Formats: PDF with text or scanned images
Chapter Detection: Automatic from table of contents
Accuracy: Footer-verified page extraction
Output: Individual chapter PDFs + zip archive

🔄 Version History

See CHANGELOG.md for a complete version history.

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [sajanjohn at the rate google mail]

Made with ❤️ for the open source community

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analyze_contents_pages.py		analyze_contents_pages.py
analyze_pdf.py		analyze_pdf.py
chapter_finder.py		chapter_finder.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
test_chapter1.py		test_chapter1.py
test_chapter1_footer.py		test_chapter1_footer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Chapter Splitter

🚀 Features

📋 Requirements

🛠️ Installation

1. Clone the Repository

2. Install Tesseract OCR

3. Set Up Python Environment

📖 Usage

Basic Usage

Example

Output

Output Structure

🔧 How It Works

1. Content Detection

2. Page Index Mapping

3. Chapter Extraction

📁 Project Structure

🧪 Testing

Test with Sample PDF

Verify Chapter Content

🔍 Troubleshooting

Common Issues

🤝 Contributing

Development Setup

Running Tests

📝 License

🙏 Acknowledgments

📊 Statistics

🔄 Version History

📞 Support

About

Uh oh!

Releases

Packages

Languages

License

sjohn2/pdf-chapter-splitter

Folders and files

Latest commit

History

Repository files navigation

PDF Chapter Splitter

🚀 Features

📋 Requirements

🛠️ Installation

1. Clone the Repository

2. Install Tesseract OCR

3. Set Up Python Environment

📖 Usage

Basic Usage

Example

Output

Output Structure

🔧 How It Works

1. Content Detection

2. Page Index Mapping

3. Chapter Extraction

📁 Project Structure

🧪 Testing

Test with Sample PDF

Verify Chapter Content

🔍 Troubleshooting

Common Issues

🤝 Contributing

Development Setup

Running Tests

📝 License

🙏 Acknowledgments

📊 Statistics

🔄 Version History

📞 Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages