Convert PDFs and EPUBs into audiobooks with synchronized text highlighting using state-of-the-art text-to-speech models, or transcribe audio files to text with timestamps using speech-to-text models.
This project provides Jupyter notebooks that:
Text-to-Speech (TTS):
- Extract text from PDFs/EPUBs with precise coordinate tracking
- Generate high-quality speech audio using AI TTS models
- Create timeline manifests for synchronized text highlighting
- Output files ready to upload to the web player at https://svm0n.github.io/ttsweb/
Speech-to-Text (STT):
- Transcribe audio files (MP3, WAV, M4A, FLAC, OGG) and video files (MP4, MOV, AVI, MKV, etc.) to text
- Generate timestamped transcripts with multiple output formats
- Support for multiple languages and translation
- Output as TXT, SRT subtitles, VTT captions, or JSON with word-level timestamps
- Automatic audio extraction from video files using ffmpeg
The easiest way to convert text/PDFs to speech is with the unified TTS notebook:
-
The notebook will automatically:
- Detect it's running in Colab
- Download required Python modules from GitHub
- Set up the environment
-
Upload your PDF/EPUB when prompted and run the cells
-
Download your generated audio and manifest files
-
Clone this repository:
git clone https://github.com/SVM0N/ttsweb.git cd ttsweb -
Open the unified notebook:
jupyter notebook TTS.ipynb
-
Follow the notebook instructions to:
- Create an isolated conda environment (optional but recommended)
- Choose your TTS model (Kokoro v0.9, Kokoro v1.0, Silero v5)
- Choose your PDF extractor (Unstructured, PyMuPDF, Vision, Nougat)
- Run synthesis on text, PDFs, or EPUBs
The unified notebook (TTS.ipynb) combines all models and extractors in one place!
Benefits of the unified notebook:
- β¨ Smart dependency installation: Only installs packages you actually need
- π― Easy configuration: Choose models/extractors in one cell at the top
- πΎ Saves storage: No need to install everything upfront
- π Easy switching: Change configuration and re-run without reinstalling
- π Works everywhere: Runs locally or in Google Colab with automatic detection
Convert audio files to text with timestamped transcripts:
-
The notebook will automatically:
- Detect it's running in Colab
- Download required Python modules from GitHub
- Set up the environment
-
Upload your audio file when prompted and run the cells
-
Download your transcripts in multiple formats (TXT, SRT, VTT, JSON)
-
Clone this repository (if you haven't already):
git clone https://github.com/SVM0N/ttsweb.github.io.git cd ttsweb.github.io -
Open the STT notebook:
jupyter notebook STT.ipynb
-
Follow the notebook instructions to:
- Create an isolated conda environment (optional but recommended)
- Choose your STT model (Whisper or Faster-Whisper)
- Select output formats (TXT, SRT, VTT, JSON)
- Run transcription on your audio files
Supported STT Models:
- Whisper (tiny, base, small, medium, large) - OpenAI's speech recognition
- Faster-Whisper (optimized versions) - 4x faster with same accuracy
- WhisperX (tiny, base, small, medium, large-v2) - Word-level timestamps + speaker diarization
Supported Input Formats:
- Audio: MP3, WAV, M4A, FLAC, OGG
- Video: MP4, MOV, AVI, MKV, WebM, FLV, etc. (audio extracted automatically)
Supported Output Formats:
- TXT: Plain text transcript
- SRT: Subtitle format with timestamps
- VTT: WebVTT captions for web players
- JSON: Full transcription data with word-level timestamps
Features:
- β¨ Multiple model sizes (choose speed vs accuracy)
- π Multi-language support with auto-detection
- π Translation to English
- β‘ Faster-Whisper for 4x speedup
- π Multiple output formats simultaneously
- π¬ Video file support (automatic audio extraction)
- π₯ Speaker Diarization: WhisperX separates and labels different speakers
- π€ Smart defaults: Automatically selects best model based on environment
- Colab:
whisperx-medium(speaker diarization, best quality, leverages GPU) - Local/M4 Mac:
faster-whisper-base(fast on CPU, good quality)
- Colab:
You can override the default model in the configuration cell if needed!
WhisperX provides advanced features:
- Speaker Separation: Automatically detects and labels different speakers
- Word-Level Timestamps: More precise timing than standard Whisper
- Perfect for: Meetings, interviews, podcasts, multi-speaker conversations
Requirements:
- HuggingFace account token (free): https://huggingface.co/settings/tokens
- Accept pyannote terms: https://huggingface.co/pyannote/speaker-diarization
- Set
HF_TOKENenvironment variable or pass in notebook
Output Example:
[SPEAKER_00] 0:00-0:05
Hello, welcome to the meeting.
[SPEAKER_01] 0:05-0:12
Thanks for having me. Let's discuss the project.
Legacy notebooks have been moved to the archived/ folder. You can still use them if you prefer the old standalone approach, but the unified notebook is recommended for new users.
Prerequisites:
- Python 3.10+
- conda (recommended) or pip
- ffmpeg (for MP3 conversion)
- macOS:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg - Windows: Download from https://ffmpeg.org/
- macOS:
Steps:
-
Clone this repository:
git clone https://github.com/SVM0N/ttsweb.git cd ttsweb -
Choose a notebook (see "Which Model to Use" below)
-
Open the notebook in Jupyter:
jupyter notebook TTS_Kokoro_Local.ipynb
-
Follow the notebook instructions to create an isolated conda environment (recommended)
-
Run all cells and provide your PDF/EPUB file path when prompted
This project now features a modular Python architecture that makes it easy to:
- Switch between different TTS models without code duplication
- Choose PDF extraction strategies based on your needs
- Extend functionality with custom backends
TTS.ipynb - Unified TTS notebook interface
- Single notebook for all TTS models and PDF extractors
- Interactive model and extractor selection
- No code duplication across notebooks
tts_backends.py - TTS model backends
KokoroBackend: Kokoro TTS (v0.9 and v1.0)SileroBackend: Silero v5 Russian TTS- Extensible for adding new models
pdf_extractors.py - PDF extraction strategies
UnstructuredExtractor: Advanced layout analysis (default)PyMuPDFExtractor: Fast extraction for clean PDFsVisionExtractor: OCR for scanned PDFs (macOS only)NougatExtractor: Academic papers with equations
tts_utils.py - Common utilities
- EPUB extraction
- Sentence splitting
- WAV to MP3 conversion
- File naming utilities
manifest.py - Manifest generation
- Timeline creation with sentence-level timing
- Coordinate tracking for synchronized highlighting
- Manifest validation and statistics
STT.ipynb - Unified STT notebook interface
- Single notebook for all STT models
- Multiple output format support
- Easy model switching
stt_backends.py - STT model backends
WhisperBackend: OpenAI Whisper modelsFasterWhisperBackend: Optimized Whisper (4x faster)- Extensible for adding new models
output_formatters.py - Transcription output formatters
- TXT format (plain text)
- SRT format (subtitles)
- VTT format (web captions)
- JSON format (timestamped data)
stt_setup.py - STT dependency installation
- Smart installation based on model selection
- Minimal dependencies
stt_examples.py - High-level STT workflows
- Complete transcription pipeline
- Multi-format output generation
config.py - Configuration management
- Device selection (CUDA/CPU/MPS)
- Output directory management
- Logging configuration
cleanup.py - Environment and cache management
- Conda environment setup and cleanup
- Model cache management
- Storage optimization
- No Code Duplication: Common functionality shared across all notebooks
- Easy to Extend: Add new TTS models or PDF extractors as plugins
- Mix and Match: Combine any TTS model with any PDF extractor
- Better Testing: Each module can be tested independently
- Cleaner Codebase: Easier to maintain and debug
You can also use the modules directly in your own Python scripts:
from config import TTSConfig
from tts_backends import create_backend
from pdf_extractors import get_available_extractors
# Configure
config = TTSConfig(output_dir=".", device="auto")
# Create TTS backend
tts = create_backend("kokoro_1.0", device=config.device)
# Get PDF extractor
extractors = get_available_extractors()
pdf_extractor = extractors["pymupdf"]
# Extract and synthesize
pdf_bytes = open("document.pdf", "rb")
elements = pdf_extractor.extract(pdf_bytes)
wav_bytes, timeline = tts.synthesize_text_to_wav(elements, voice="af_heart")When to use:
- You want a single notebook that supports all models
- You want to easily switch between TTS models
- You want to try different PDF extraction strategies
- You prefer a clean, modular interface
Supported TTS Models:
- Kokoro v0.9.4+ (10 voices, English-focused, stable)
- Kokoro v1.0 (54 voices, 8 languages, latest)
- Maya1 (20+ emotions, natural language voices, expressive, GPU required)
- Silero v5 (Russian, 6 speakers)
Supported PDF Extractors:
- Unstructured (advanced layout analysis)
- PyMuPDF (fast, lightweight)
- Apple Vision (OCR, macOS only)
- Nougat (academic papers)
Pros:
- All-in-one solution
- No code duplication
- Easy to switch between models
- Modular and extensible
- Maya1 support for expressive, emotional speech
Cons:
- Requires all module files (tts_backends.py, pdf_extractors.py, etc.)
- Maya1 requires GPU with 16GB+ VRAM
What makes Maya1 special:
- 20+ emotions: laugh, cry, whisper, angry, sigh, gasp, and more
- Natural language voice descriptions: Describe voices in plain English like "40-year-old, warm, conversational"
- Inline emotion tags: Add emotions directly in text like
<laugh>or<whisper> - High quality: 3B parameters, trained on diverse data
Requirements:
- GPU with 16GB+ VRAM (A100, H100, or RTX 4090 recommended)
- CUDA support (will not work on CPU/MPS)
- First run downloads ~3GB model
- Best for Google Colab with GPU runtime
Example voice descriptions:
- "Realistic male voice in the 30s with American accent"
- "Warm female voice in the 40s, conversational"
- "Young energetic male voice, British accent"
Example with emotions:
"Hello! <laugh> This is amazing. <whisper> Can you hear me?"
Each notebook generates two files:
- Format: MP3 or WAV (configurable)
- Sample Rate: 24kHz (F5-MLX) or 24kHz (Kokoro)
- Naming:
{filename}_tts.mp3or{filename}_tts.wav
- Format: JSON
- Naming:
{filename}_tts_manifest.json - Contains:
- Sentence-level timestamps
- Text content for each segment
- Coordinate data for highlighting (page number, bounding boxes)
Example manifest structure:
{
"audioUrl": "document_tts.mp3",
"sentences": [
{
"i": 0,
"start": 0.0,
"end": 2.5,
"text": "This is the first sentence.",
"location": {
"page_number": 1,
"points": [[x0,y0], [x1,y1], [x2,y2], [x3,y3]]
}
}
]
}-
Generate your files using any notebook above
-
Upload to the web player at: https://svm0n.github.io/ttsweb/
-
Upload both files:
- Your PDF file
- The audio file (MP3/WAV)
- The manifest JSON file
-
Play and enjoy synchronized audio with text highlighting!
The web player features:
- PDF rendering with synchronized highlighting
- Audio playback controls (play/pause, seek, speed control)
- Click on text to jump to that audio position
- Dark mode support
- Mobile-friendly responsive design
Kokoro v0.9.x (TTS_Kokoro_Local.ipynb):
- Available voices:
af_heart,af_bella,af_sarah,am_adam,am_michael, and more
Kokoro v1.0 (TTS_Kokoro_v.1.0_Local.ipynb):
- 54 voices across 8 languages (see full list in section 2 above)
- US, British, French, Japanese, Korean, Chinese voices available
VOICE = "af_heart" # Change to any available voiceProvide reference audio for zero-shot voice cloning:
REF_AUDIO = "reference.wav" # 5-10s mono WAV at 24kHz
REF_TEXT = "This is what the speaker says in the reference audio."Convert your audio:
ffmpeg -i input.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 reference.wavFORMAT = "mp3" # or "wav"SPEED = 1.0 # 0.5 = half speed, 2.0 = double speedTTS models are cached locally to improve performance. Each notebook includes a cache management section at the end where you can:
-
View cache locations and sizes for:
- HuggingFace models (
~/.cache/huggingface/) - PyTorch models (
~/.cache/torch/) - Pip packages
- Model-specific caches
- HuggingFace models (
-
Delete cached models to free up storage:
- Individual model deletion
- Bulk cache cleanup
- Environment-specific cleanup
Typical cache sizes:
- Kokoro models: ~500MB - 1GB
- F5-TTS-MLX models: ~300MB - 500MB
- Detectron2 models: ~200MB - 400MB
- Nougat models: ~1GB - 2GB
Each local notebook includes an optional cleanup section at the end to help manage these caches.
I want the easiest, most flexible option: β Use TTS.ipynb β (Unified notebook - recommended for everyone)
I need expressive, emotional speech: β Use TTS.ipynb with Maya1 backend (requires GPU in Google Colab)
I need Russian language TTS: β Use TTS.ipynb with Silero v5 backend
I have Apple Silicon (M1/M2/M3/M4): β Use TTS_F5_MLX.ipynb (archived/TTS_F5_MLX.ipynb) or TTS.ipynb with Kokoro
I need maximum speed and my PDF has text: β Use TTS.ipynb with PyMuPDF extractor
I have a scanned PDF (no text layer): β Use TTS.ipynb with Vision/Nougat extractor
I have an academic paper with equations: β Use TTS.ipynb with Nougat extractor
I have access to a GPU with 16GB+ VRAM: β Try Maya1 for the most expressive and natural-sounding speech
I prefer the old standalone notebooks:
β Check the archived/ folder for legacy notebooks
- Add this before imports:
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1' - Restart your Jupyter kernel
- This has been fixed in the latest version
- Make sure you're using the updated notebooks
- Use TTS_Kokoro_PyMuPDF.ipynb (lowest memory usage)
- Or process shorter documents
- Or add more RAM/swap space
- If using PyMuPDF: Your PDF might be scanned β Use Vision or Nougat
- If using Vision: Check macOS version compatibility
- If using Nougat: Ensure GPU is available and CUDA installed
This project is licensed for non-commercial use only. For commercial licensing, please contact SVM0N on GitHub.
This project uses:
- Kokoro TTS - High-quality text-to-speech
- F5-TTS-MLX - Apple Silicon optimized TTS
- Unstructured.io - Document parsing
- Detectron2 - Layout detection
- PyMuPDF - Fast PDF processing
- Nougat - Academic document OCR
- PDF.js - Web PDF rendering
The following notebooks have been moved to the archived/ folder and are still available for backwards compatibility:
When to use:
- General purpose, works on most machines
- Best balance of quality, speed, and coordinate accuracy
- PDF extraction using ML-based layout analysis
- Stable version with Kokoro v0.9.4+
Machine requirements:
- RAM: 8GB minimum, 16GB recommended
- GPU: Optional (CUDA) but works fine on CPU
- Storage: ~5GB for dependencies
Pros:
- Excellent text extraction for complex layouts
- Accurate bounding box coordinates
- Multiple voice options (10 voices)
- Reliable and well-tested
Cons:
- Slower PDF processing than PyMuPDF
- Larger dependency footprint
When to use:
- Want the latest Kokoro v1.0 features
- Need access to 54 voices across 8 languages
- Want voice blending capabilities
- Multi-language support (French, Japanese, Korean, Chinese, etc.)
Machine requirements:
- RAM: 8GB minimum, 16GB recommended
- GPU: Optional (CUDA) but works fine on CPU
- Storage: ~5GB for dependencies
Pros:
- 54 voices (vs 10 in v0.9.x)
- 8 languages (vs 1 in v0.9.x)
- Voice blending for custom voices
- Same API as v0.9.x (backward compatible)
- Trained on hundreds of hours of audio
Cons:
- Newer, less battle-tested than v0.9.x
- Larger model downloads
Available voices:
- US Female (11): af_alloy, af_aoede, af_bella, af_heart, af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky
- US Male (8): am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck
- British Female (4): bf_alice, bf_emma, bf_isabella, bf_lily
- British Male (4): bm_daniel, bm_fable, bm_george, bm_lewis
- Plus voices for French, Japanese, Korean, Chinese, and more
When to use:
- You have Apple Silicon (M1/M2/M3/M4)
- Want maximum performance on Mac
- Need voice cloning capabilities
Machine requirements:
- Apple Silicon Mac (M1/M2/M3/M4)
- RAM: 8GB minimum, 16GB recommended
- Storage: ~5GB for dependencies
Pros:
- Optimized for Apple's MLX framework
- Excellent multicore utilization
- Zero-shot voice cloning support
- ~4 seconds per sentence on M3/M4
Cons:
- Apple Silicon only (CPU fallback available but slow)
- Requires reference audio for voice cloning
Voice cloning: Provide a 5-10 second mono WAV file (24kHz) and its transcription to clone any voice.
When to use:
- PDF has a clean text layer (not scanned)
- Need fastest possible processing
- Have limited RAM/storage
Machine requirements:
- RAM: 4GB minimum
- CPU: Any modern CPU
- Storage: ~2GB for dependencies
Pros:
- Extremely fast PDF text extraction
- Minimal dependencies
- Low resource usage
- Very accurate coordinates
Cons:
- Only works for PDFs with text layers
- Fails on scanned PDFs or images
- No layout analysis
When to use:
- PDF is scanned (no text layer)
- Need OCR capabilities
- macOS with Vision Framework
Machine requirements:
- macOS 10.15+
- RAM: 8GB minimum
- Storage: ~3GB for dependencies
Pros:
- Works on scanned/image-based PDFs
- Uses Apple's Vision Framework OCR
- Good for documents without text layers
Cons:
- macOS only
- Slower than direct text extraction
- OCR may have accuracy issues
When to use:
- Need Russian language text-to-speech
- Want high-quality Russian voice synthesis
- Processing Russian documents
Machine requirements:
- RAM: 8GB minimum
- GPU: Optional (CUDA) but works fine on CPU
- Storage: ~3GB for dependencies
Pros:
- Excellent Russian pronunciation
- 6 different speakers (xenia, eugene, baya, kseniya, aleksandr, irina)
- SSML support with automated stress and homographs
- Fast synthesis
Cons:
- Russian language only
- Limited to 6 speakers
When to use:
- Processing academic papers with equations
- Need LaTeX/math support
- Document has complex formatting
Machine requirements:
- RAM: 16GB recommended
- GPU: Highly recommended (CUDA)
- Storage: ~8GB for dependencies
Pros:
- Excellent for academic documents
- Handles equations and math notation
- LaTeX support
Cons:
- Very slow without GPU
- Large model downloads
- Overkill for simple text
Made with β€οΈ for accessible reading