EpubSage is a powerful Python library and CLI tool for extracting structured content, metadata, and images from EPUB files. It handles the complexity of diverse publisher formats and provides a clean, unified API.
EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters split across files, and metadata formats differ wildly. EpubSage abstracts this complexity:
from epub_sage import process_epub
result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Chapters: {result.total_chapters}")That's it. One function call to extract everything.
| Feature | Description |
|---|---|
| Publisher-Agnostic | Works with O'Reilly, Packt, Manning, and more |
| Complete Extraction | Chapters, metadata, images, word counts, reading time |
| TOC-Based Extraction | Precise section splitting using TOC anchor boundaries |
| Smart Image Handling | Discovers and validates all referenced images |
| Content Classification | Identifies front matter, chapters, back matter, parts |
| Dublin Core Metadata | Full standards-compliant metadata extraction |
| TOC Parsing | Supports NCX (EPUB 2) and NAV (EPUB 3) |
| Full-Text Search | Search across all book content |
| CLI Tool | 13 commands for complete EPUB analysis |
- Python 3.10+
- Dependencies:
beautifulsoup4,lxml,pydantic,typer,rich
pip install epubsageOr with uv:
uv add epubsagefrom epub_sage import process_epub
result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Words: {result.total_words:,}")
print(f"Reading time: {result.estimated_reading_time}")
for chapter in result.chapters[:3]:
print(f" {chapter['chapter_id']}: {chapter['title']}")epub-sage info book.epubEpubSage includes 13 commands for complete EPUB analysis.
epub-sage --help| Command | Description |
|---|---|
info |
Quick book summary |
stats |
Detailed statistics |
chapters |
List chapters with word counts |
metadata |
Dublin Core metadata |
toc |
Table of contents |
images |
Image distribution |
search |
Full-text search |
validate |
Validate EPUB structure |
spine |
Reading order |
manifest |
All EPUB resources |
extract |
Export to JSON |
list |
Raw EPUB contents |
cover |
Extract cover image |
epub-sage chapters book.epubepub-sage search book.epub "machine learning"epub-sage extract book.epub -o output.jsonfrom epub_sage import process_epub
result = process_epub("book.epub")
if result.success:
print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Chapters: {result.total_chapters}")
else:
print(f"Errors: {result.errors}")for chapter in result.chapters:
print(f"{chapter['chapter_id']}: {chapter['title']}")
print(f" Words: {chapter['word_count']}")
print(f" Images: {len(chapter['images'])}")
print(f" Type: {chapter['content_type']}")metadata = result.full_metadata
print(f"Title: {metadata.title}")
print(f"Publisher: {metadata.publisher}")
print(f"ISBN: {metadata.get_isbn()}")
print(f"Publication Date: {metadata.get_publication_date()}")for chapter in result.chapters:
if chapter['images']:
print(f"Chapter: {chapter['title']}")
for img in chapter['images']:
print(f" - {img}")chapter = result.chapters[0]
for block in chapter['content']:
print(f"[{block['tag']}] {block['text'][:100]}...")| Field | Type | Description |
|---|---|---|
title |
str |
Book title |
author |
str |
Primary author |
publisher |
str |
Publisher name |
chapters |
list[dict] |
Chapter data |
total_chapters |
int |
Chapter count |
total_words |
int |
Word count |
estimated_reading_time |
dict |
{'hours': N, 'minutes': N} |
success |
bool |
Processing status |
full_metadata |
DublinCoreMetadata |
Complete metadata |
| Field | Type | Description |
|---|---|---|
chapter_id |
int |
Sequential ID |
title |
str |
Chapter title |
word_count |
int |
Words in chapter |
images |
list[str] |
Image paths |
content |
list[dict] |
Content blocks |
sections |
list[dict] |
TOC-based sections with nested subsections |
content_type |
str |
chapter, front_matter, back_matter, part |
epub_sage/
├── core/ # Parsers (Dublin Core, Structure, TOC)
├── extractors/ # EPUB handling, content extraction
├── processors/ # Processing pipelines
├── models/ # Pydantic data models
├── services/ # Search, export services
└── cli.py # Command-line interface
Processing Pipeline:
EpubExtractor→ Unzips EPUBDublinCoreParser→ Extracts metadataEpubStructureParser→ Analyzes structureContentExtractor→ Extracts text & imagesSimpleEpubProcessor→ Orchestrates all steps
git clone https://github.com/Abdullah-Wex/epubsage.git
cd epubsage
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"make test # Run 60+ tests
make format # Format code
make lint # Check qualityPYTHONPATH="$PWD" .venv/bin/python -m pytest tests/ -v| Document | Description |
|---|---|
| CLI Reference | Complete CLI documentation |
| API Reference | Python API documentation |
| Examples | Real-world use cases |
MIT License. See LICENSE for details.
Contributions welcome! See CONTRIBUTING.md for guidelines.
EpubSage — Extract. Analyze. Build.









