Skip to content

Abdullah-Wex/epubsage

EpubSage

PyPI version Python versions Tests License: MIT

EpubSage is a powerful Python library and CLI tool for extracting structured content, metadata, and images from EPUB files. It handles the complexity of diverse publisher formats and provides a clean, unified API.

Why EpubSage?

EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters split across files, and metadata formats differ wildly. EpubSage abstracts this complexity:

from epub_sage import process_epub

result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Chapters: {result.total_chapters}")

That's it. One function call to extract everything.

Features

Feature Description
Publisher-Agnostic Works with O'Reilly, Packt, Manning, and more
Complete Extraction Chapters, metadata, images, word counts, reading time
TOC-Based Extraction Precise section splitting using TOC anchor boundaries
Smart Image Handling Discovers and validates all referenced images
Content Classification Identifies front matter, chapters, back matter, parts
Dublin Core Metadata Full standards-compliant metadata extraction
TOC Parsing Supports NCX (EPUB 2) and NAV (EPUB 3)
Full-Text Search Search across all book content
CLI Tool 13 commands for complete EPUB analysis

Requirements

  • Python 3.10+
  • Dependencies: beautifulsoup4, lxml, pydantic, typer, rich

Installation

pip install epubsage

Or with uv:

uv add epubsage

Quick Start

Python

from epub_sage import process_epub

result = process_epub("book.epub")

print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Words: {result.total_words:,}")
print(f"Reading time: {result.estimated_reading_time}")

for chapter in result.chapters[:3]:
    print(f"  {chapter['chapter_id']}: {chapter['title']}")

Python Basic Usage

Command Line

epub-sage info book.epub

CLI Info

Command Line Interface

EpubSage includes 13 commands for complete EPUB analysis.

epub-sage --help

CLI Help

Command Description
info Quick book summary
stats Detailed statistics
chapters List chapters with word counts
metadata Dublin Core metadata
toc Table of contents
images Image distribution
search Full-text search
validate Validate EPUB structure
spine Reading order
manifest All EPUB resources
extract Export to JSON
list Raw EPUB contents
cover Extract cover image

View full CLI documentation →

Key Commands

chapters

epub-sage chapters book.epub

CLI Chapters

search

epub-sage search book.epub "machine learning"

CLI Search

extract

epub-sage extract book.epub -o output.json

CLI Extract

Python Library

Basic Processing

from epub_sage import process_epub

result = process_epub("book.epub")

if result.success:
    print(f"Title: {result.title}")
    print(f"Author: {result.author}")
    print(f"Chapters: {result.total_chapters}")
else:
    print(f"Errors: {result.errors}")

Iterate Chapters

for chapter in result.chapters:
    print(f"{chapter['chapter_id']}: {chapter['title']}")
    print(f"  Words: {chapter['word_count']}")
    print(f"  Images: {len(chapter['images'])}")
    print(f"  Type: {chapter['content_type']}")

Python Chapters

Access Metadata

metadata = result.full_metadata

print(f"Title: {metadata.title}")
print(f"Publisher: {metadata.publisher}")
print(f"ISBN: {metadata.get_isbn()}")
print(f"Publication Date: {metadata.get_publication_date()}")

Python Metadata

Extract Images

for chapter in result.chapters:
    if chapter['images']:
        print(f"Chapter: {chapter['title']}")
        for img in chapter['images']:
            print(f"  - {img}")

Python Images

Content Blocks

chapter = result.chapters[0]

for block in chapter['content']:
    print(f"[{block['tag']}] {block['text'][:100]}...")

Python Content

View full API documentation →

View real-world examples →

Output Format

SimpleEpubResult

Field Type Description
title str Book title
author str Primary author
publisher str Publisher name
chapters list[dict] Chapter data
total_chapters int Chapter count
total_words int Word count
estimated_reading_time dict {'hours': N, 'minutes': N}
success bool Processing status
full_metadata DublinCoreMetadata Complete metadata

Chapter Dictionary

Field Type Description
chapter_id int Sequential ID
title str Chapter title
word_count int Words in chapter
images list[str] Image paths
content list[dict] Content blocks
sections list[dict] TOC-based sections with nested subsections
content_type str chapter, front_matter, back_matter, part

View complete data models →

Architecture

epub_sage/
├── core/           # Parsers (Dublin Core, Structure, TOC)
├── extractors/     # EPUB handling, content extraction
├── processors/     # Processing pipelines
├── models/         # Pydantic data models
├── services/       # Search, export services
└── cli.py          # Command-line interface

Processing Pipeline:

  1. EpubExtractor → Unzips EPUB
  2. DublinCoreParser → Extracts metadata
  3. EpubStructureParser → Analyzes structure
  4. ContentExtractor → Extracts text & images
  5. SimpleEpubProcessor → Orchestrates all steps

Development

Setup

git clone https://github.com/Abdullah-Wex/epubsage.git
cd epubsage
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

Commands

make test      # Run 60+ tests
make format    # Format code
make lint      # Check quality

Running Tests

PYTHONPATH="$PWD" .venv/bin/python -m pytest tests/ -v

Documentation

Document Description
CLI Reference Complete CLI documentation
API Reference Python API documentation
Examples Real-world use cases

License

MIT License. See LICENSE for details.

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.


EpubSage — Extract. Analyze. Build.

About

A professional EPUB structure and content extraction library with Dublin Core metadata support

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors