Video transcription and translation CLI for language learners.
Transcribe with Whisper (local or cloud API), translate with LLMs, play with dual subtitles — all in one pipeline.
- Whisper transcription with word-level timestamps — local (stable-ts, MLX/CUDA/CPU) or cloud API (Groq, OpenAI via LiteLLM)
- Smart subtitle segmentation — spaCy POS tagging fixes dangling articles, prepositions, and Romance clitics (l', d', qu') across 26 languages
- Subtitle download — optionally grabs existing subtitles from YouTube/etc. via yt-dlp (human-made preferred,
--substo enable), skips Whisper when available - LLM translation — any language pair via Ollama (local) or cloud LLMs (Groq, OpenAI, Claude, etc.)
- Vocabulary analysis — CEFR difficulty estimation (A1–C2), rare word extraction with context and translations
- Dual playback — original + translation subtitles in mpv, or browser-based web player (
pgw serve) - Batch processing — multiple files, glob patterns, URL lists, with error-continue
- Export — VTT, SRT, ASS, plain text, bilingual VTT, side-by-side PDF/EPUB
- Shared cache — deduplicates downloads, audio extraction, and transcriptions across workspaces
# macOS
brew install uv ffmpeg mpv
brew install pango # required for PDF export (WeasyPrint)
brew install --cask ollama # optional
# Ubuntu/Debian
sudo apt install ffmpeg mpv libpango-1.0-0 libpangoft2-1.0-0
curl -fsSL https://astral.sh/uv/install.sh | sh
curl -fsSL https://ollama.com/install.sh | sh # optionalmacOS PDF export: If PDF export fails with
cannot load library 'libgobject-2.0-0', add this to your~/.zshrc:export DYLD_FALLBACK_LIBRARY_PATH=/opt/homebrew/libThis lets the uv-managed Python find Homebrew's native libraries.
git clone https://github.com/RizhongLin/PolyglotWhisperer.git
cd PolyglotWhisperer
uv sync --all-extras
# Pull a local LLM for translation (optional)
ollama pull qwen3:8bspaCy language models are downloaded automatically on first use.
Install only what you need
uv sync --extra transcribe # Local Whisper (stable-ts, MLX)
uv sync --extra download # URL downloading (yt-dlp)
uv sync --extra llm # LLM translation (LiteLLM, Ollama)
uv sync --extra nlp # spaCy NLP (POS tagging, lemmatizer)
uv sync --extra vocab # Vocabulary analysis (wordfreq + spaCy)
uv sync --extra export # PDF/EPUB export (WeasyPrint, ebooklib)cp .env.example .env # edit and add your keysLiteLLM routes to any provider via model prefix — set the matching API key in .env. See .env.example for supported providers.
# Full pipeline: download → transcribe → translate → play
pgw run "https://example.com/video" --translate en --no-play
# Refine transcription with LLM before translating
pgw run "https://example.com/video" --refine --translate en --no-play
# Cloud API transcription + translation (no local GPU needed)
pgw run "https://example.com/video" --backend api --llm-backend api --translate en --no-play
# Reuse existing subtitles from video page (skip Whisper if available)
pgw run "https://example.com/video" --subs --translate en --no-play
# Batch processing
pgw run *.mp4 --translate en --no-play
pgw run urls.txt --backend api --translate en --no-play
# Transcribe only
pgw transcribe video.mp4 -l fr
pgw transcribe *.mp4 --backend api -l fr
# Translate existing subtitles
pgw translate subtitles.fr.vtt --to en
# Vocabulary analysis
pgw vocab pgw_workspace/my-video/20260217_164802/
# Playback
pgw play pgw_workspace/my-video/20260217_164802/
pgw serve pgw_workspace/my-video/20260217_164802/ # web playerConfig layers (lowest to highest priority): packaged defaults → ~/.config/pgw/config.toml → ./pgw.toml → .env + env vars → CLI flags.
# pgw.toml
[whisper]
backend = "api" # "local" or "api"
api_model = "groq/whisper-large-v3-turbo" # provider/model via LiteLLM
language = "fr"
[llm]
backend = "api" # "local" or "api"
local_model = "ollama_chat/qwen3:8b" # Ollama for local backend
api_model = "openrouter/openai/gpt-oss-120b" # any LiteLLM provider/model
target_language = "en"Environment variables use PGW_ prefix: PGW_WHISPER__BACKEND=api, PGW_LLM__BACKEND=api, PGW_LLM__API_MODEL=<provider/model>.
pgw_workspace/
├── .cache/ # Shared cache (cross-workspace)
│ ├── audio/ # Extracted audio
│ ├── compressed/ # API-compressed MP3s
│ ├── transcriptions/ # Whisper results (local + API)
│ └── downloads/ # yt-dlp downloads + subtitles
└── my-video/
└── 20260217_164802/
├── video.mp4 # Symlinked from source
├── audio.wav # Symlinked from cache
├── transcription.fr.vtt # Original subtitles (from Whisper or downloaded)
├── transcription.fr.txt # Plain text
├── translation.en.vtt # Translated subtitles
├── translation.en.txt # Translation plain text
├── bilingual.fr-en.vtt # Dual-language VTT
├── parallel.fr-en.pdf # Side-by-side PDF
├── parallel.fr-en.epub # Side-by-side EPUB
├── vocabulary.fr.json # CEFR analysis + rare words
└── metadata.json
| Backend | Technology | Pros | Limits |
|---|---|---|---|
| Local (default) | stable-ts | Best quality, word-level timestamps, custom regrouping | Requires GPU / model downloads |
| Cloud API | LiteLLM | Fast, cheap, no GPU, auto-compresses large files | API key required |
# Local
pgw transcribe audio.wav -l fr # large-v3-turbo on MLX
pgw transcribe audio.wav -l fr --whisper-model medium # smaller model
# Cloud API (any LiteLLM-supported provider)
pgw transcribe audio.wav --backend api -l fr
pgw transcribe audio.wav --backend api --whisper-model openai/whisper-1 -l frEach processed video gets a vocabulary profile: CEFR level estimation via wordfreq, top 30 rare words with context and translation, spaCy lemmatization to group inflected forms.
pgw vocab pgw_workspace/my-video/20260217_164802/ --top 50Video/Audio/URL
→ Download (yt-dlp, cached) + fetch existing subtitles
→ Extract Audio (ffmpeg, cached)
→ Use downloaded subtitles OR Transcribe (Whisper + spaCy segmentation)
→ Refine transcription (LLM, optional — fixes ASR errors, punctuation)
→ Translate (LLM, optional — sentence-boundary chunking with overlap)
→ Export (VTT/TXT/bilingual VTT/PDF/EPUB) + Vocabulary Analysis
→ Play (mpv or web player)
| Component | Technology |
|---|---|
| Transcription | stable-ts (MLX/CUDA/CPU) |
| Cloud APIs | LiteLLM (Groq, OpenAI, Ollama, Claude) |
| NLP | spaCy (26 language codes) + wordfreq |
| Export | WeasyPrint (PDF) + ebooklib (EPUB) |
| Subtitles | pysubs2 |
| Download | yt-dlp |
| Playback | mpv |
| CLI | Typer + Rich |
Whisper supports 100 languages — run pgw languages for the full list. spaCy POS tagging and clitic handling covers 26 language codes (including Norwegian no/nn aliases).
Common language codes
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
fr |
French | zh |
Chinese | pl |
Polish |
en |
English | ja |
Japanese | sv |
Swedish |
de |
German | ko |
Korean | da |
Danish |
es |
Spanish | ar |
Arabic | fi |
Finnish |
it |
Italian | ru |
Russian | uk |
Ukrainian |
pt |
Portuguese | hi |
Hindi | vi |
Vietnamese |
nl |
Dutch | tr |
Turkish |
- Whisper transcription (local + cloud API) with word-level timestamps
- LLM translation + dual subtitle playback
- spaCy subtitle segmentation + Romance clitic handling (26 language codes)
- Audio cache, batch processing, vocabulary analysis, parallel text export
- Streaming pipeline event system
- Subtitle download from video pages, web player, content-addressable cache
- Hosted demo (Gradio on Hugging Face Spaces)
- Speaker diarization
- Anki card generation from subtitle pairs
