EpubSage

EpubSage is a powerful Python library and CLI tool for extracting structured content, metadata, and images from EPUB files. It handles the complexity of diverse publisher formats and provides a clean, unified API.

Why EpubSage?

EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters split across files, and metadata formats differ wildly. EpubSage abstracts this complexity:

from epub_sage import process_epub

result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Chapters: {result.total_chapters}")

That's it. One function call to extract everything.

Features

Feature	Description
Publisher-Agnostic	Works with O'Reilly, Packt, Manning, and more
Complete Extraction	Chapters, metadata, images, word counts, reading time
TOC-Based Extraction	Precise section splitting using TOC anchor boundaries
Smart Image Handling	Discovers and validates all referenced images
Content Classification	Identifies front matter, chapters, back matter, parts
Dublin Core Metadata	Full standards-compliant metadata extraction
TOC Parsing	Supports NCX (EPUB 2) and NAV (EPUB 3)
Full-Text Search	Search across all book content
CLI Tool	13 commands for complete EPUB analysis

Requirements

Python 3.10+
Dependencies: beautifulsoup4, lxml, pydantic, typer, rich

Installation

pip install epubsage

Or with uv:

uv add epubsage

Quick Start

Python

from epub_sage import process_epub

result = process_epub("book.epub")

print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Words: {result.total_words:,}")
print(f"Reading time: {result.estimated_reading_time}")

for chapter in result.chapters[:3]:
    print(f"  {chapter['chapter_id']}: {chapter['title']}")

Command Line

epub-sage info book.epub

Command Line Interface

EpubSage includes 13 commands for complete EPUB analysis.

epub-sage --help

Command	Description
`info`	Quick book summary
`stats`	Detailed statistics
`chapters`	List chapters with word counts
`metadata`	Dublin Core metadata
`toc`	Table of contents
`images`	Image distribution
`search`	Full-text search
`validate`	Validate EPUB structure
`spine`	Reading order
`manifest`	All EPUB resources
`extract`	Export to JSON
`list`	Raw EPUB contents
`cover`	Extract cover image

View full CLI documentation →

Key Commands

chapters

epub-sage chapters book.epub

search

epub-sage search book.epub "machine learning"

extract

epub-sage extract book.epub -o output.json

Python Library

Basic Processing

from epub_sage import process_epub

result = process_epub("book.epub")

if result.success:
    print(f"Title: {result.title}")
    print(f"Author: {result.author}")
    print(f"Chapters: {result.total_chapters}")
else:
    print(f"Errors: {result.errors}")

Iterate Chapters

for chapter in result.chapters:
    print(f"{chapter['chapter_id']}: {chapter['title']}")
    print(f"  Words: {chapter['word_count']}")
    print(f"  Images: {len(chapter['images'])}")
    print(f"  Type: {chapter['content_type']}")

Access Metadata

metadata = result.full_metadata

print(f"Title: {metadata.title}")
print(f"Publisher: {metadata.publisher}")
print(f"ISBN: {metadata.get_isbn()}")
print(f"Publication Date: {metadata.get_publication_date()}")

Extract Images

for chapter in result.chapters:
    if chapter['images']:
        print(f"Chapter: {chapter['title']}")
        for img in chapter['images']:
            print(f"  - {img}")

Content Blocks

chapter = result.chapters[0]

for block in chapter['content']:
    print(f"[{block['tag']}] {block['text'][:100]}...")

View full API documentation →

View real-world examples →

Output Format

SimpleEpubResult

Field	Type	Description
`title`	`str`	Book title
`author`	`str`	Primary author
`publisher`	`str`	Publisher name
`chapters`	`list[dict]`	Chapter data
`total_chapters`	`int`	Chapter count
`total_words`	`int`	Word count
`estimated_reading_time`	`dict`	`{'hours': N, 'minutes': N}`
`success`	`bool`	Processing status
`full_metadata`	`DublinCoreMetadata`	Complete metadata

Chapter Dictionary

Field	Type	Description
`chapter_id`	`int`	Sequential ID
`title`	`str`	Chapter title
`word_count`	`int`	Words in chapter
`images`	`list[str]`	Image paths
`content`	`list[dict]`	Content blocks
`sections`	`list[dict]`	TOC-based sections with nested `subsections`
`content_type`	`str`	`chapter`, `front_matter`, `back_matter`, `part`

View complete data models →

Architecture

epub_sage/
├── core/           # Parsers (Dublin Core, Structure, TOC)
├── extractors/     # EPUB handling, content extraction
├── processors/     # Processing pipelines
├── models/         # Pydantic data models
├── services/       # Search, export services
└── cli.py          # Command-line interface

Processing Pipeline:

EpubExtractor → Unzips EPUB
DublinCoreParser → Extracts metadata
EpubStructureParser → Analyzes structure
ContentExtractor → Extracts text & images
SimpleEpubProcessor → Orchestrates all steps

Development

Setup

git clone https://github.com/Abdullah-Wex/epubsage.git
cd epubsage
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

Commands

make test      # Run 60+ tests
make format    # Format code
make lint      # Check quality

Running Tests

PYTHONPATH="$PWD" .venv/bin/python -m pytest tests/ -v

Documentation

Document	Description
CLI Reference	Complete CLI documentation
API Reference	Python API documentation
Examples	Real-world use cases

License

MIT License. See LICENSE for details.

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

EpubSage — Extract. Analyze. Build.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
docs		docs
epub_sage		epub_sage
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
example_usage.py		example_usage.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

EpubSage

Why EpubSage?

Features

Requirements

Installation

Quick Start

Python

Command Line

Command Line Interface

Key Commands

chapters

search

extract

Python Library

Basic Processing

Iterate Chapters

Access Metadata

Extract Images

Content Blocks

Output Format

SimpleEpubResult

Chapter Dictionary

Architecture

Development

Setup

Commands

Running Tests

Documentation

License

Contributing

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages