Skip to content

RizhongLin/PolyglotWhisperer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PolyglotWhisperer logo

PolyglotWhisperer

Video transcription and translation CLI for language learners.
Transcribe with Whisper (local or cloud API), translate with LLMs, play with dual subtitles — all in one pipeline.

Features

  • Whisper transcription with word-level timestamps — local (stable-ts, MLX/CUDA/CPU) or cloud API (Groq, OpenAI via LiteLLM)
  • Smart subtitle segmentation — spaCy POS tagging fixes dangling articles, prepositions, and Romance clitics (l', d', qu') across 26 languages
  • Subtitle download — optionally grabs existing subtitles from YouTube/etc. via yt-dlp (human-made preferred, --subs to enable), skips Whisper when available
  • LLM translation — any language pair via Ollama (local) or cloud LLMs (Groq, OpenAI, Claude, etc.)
  • Vocabulary analysis — CEFR difficulty estimation (A1–C2), rare word extraction with context and translations
  • Dual playback — original + translation subtitles in mpv, or browser-based web player (pgw serve)
  • Batch processing — multiple files, glob patterns, URL lists, with error-continue
  • Export — VTT, SRT, ASS, plain text, bilingual VTT, side-by-side PDF/EPUB
  • Shared cache — deduplicates downloads, audio extraction, and transcriptions across workspaces

Quick Start

Prerequisites

# macOS
brew install uv ffmpeg mpv
brew install pango           # required for PDF export (WeasyPrint)
brew install --cask ollama   # optional

# Ubuntu/Debian
sudo apt install ffmpeg mpv libpango-1.0-0 libpangoft2-1.0-0
curl -fsSL https://astral.sh/uv/install.sh | sh
curl -fsSL https://ollama.com/install.sh | sh   # optional

macOS PDF export: If PDF export fails with cannot load library 'libgobject-2.0-0', add this to your ~/.zshrc:

export DYLD_FALLBACK_LIBRARY_PATH=/opt/homebrew/lib

This lets the uv-managed Python find Homebrew's native libraries.

Installation

git clone https://github.com/RizhongLin/PolyglotWhisperer.git
cd PolyglotWhisperer
uv sync --all-extras

# Pull a local LLM for translation (optional)
ollama pull qwen3:8b

spaCy language models are downloaded automatically on first use.

Install only what you need
uv sync --extra transcribe    # Local Whisper (stable-ts, MLX)
uv sync --extra download      # URL downloading (yt-dlp)
uv sync --extra llm           # LLM translation (LiteLLM, Ollama)
uv sync --extra nlp           # spaCy NLP (POS tagging, lemmatizer)
uv sync --extra vocab         # Vocabulary analysis (wordfreq + spaCy)
uv sync --extra export        # PDF/EPUB export (WeasyPrint, ebooklib)

API Keys (for cloud providers)

cp .env.example .env   # edit and add your keys

LiteLLM routes to any provider via model prefix — set the matching API key in .env. See .env.example for supported providers.

Usage

# Full pipeline: download → transcribe → translate → play
pgw run "https://example.com/video" --translate en --no-play

# Refine transcription with LLM before translating
pgw run "https://example.com/video" --refine --translate en --no-play

# Cloud API transcription + translation (no local GPU needed)
pgw run "https://example.com/video" --backend api --llm-backend api --translate en --no-play

# Reuse existing subtitles from video page (skip Whisper if available)
pgw run "https://example.com/video" --subs --translate en --no-play

# Batch processing
pgw run *.mp4 --translate en --no-play
pgw run urls.txt --backend api --translate en --no-play

# Transcribe only
pgw transcribe video.mp4 -l fr
pgw transcribe *.mp4 --backend api -l fr

# Translate existing subtitles
pgw translate subtitles.fr.vtt --to en

# Vocabulary analysis
pgw vocab pgw_workspace/my-video/20260217_164802/

# Playback
pgw play pgw_workspace/my-video/20260217_164802/
pgw serve pgw_workspace/my-video/20260217_164802/   # web player

Configuration

Config layers (lowest to highest priority): packaged defaults → ~/.config/pgw/config.toml./pgw.toml.env + env vars → CLI flags.

# pgw.toml
[whisper]
backend = "api"                       # "local" or "api"
api_model = "groq/whisper-large-v3-turbo"  # provider/model via LiteLLM
language = "fr"

[llm]
backend = "api"                       # "local" or "api"
local_model = "ollama_chat/qwen3:8b"  # Ollama for local backend
api_model = "openrouter/openai/gpt-oss-120b"  # any LiteLLM provider/model
target_language = "en"

Environment variables use PGW_ prefix: PGW_WHISPER__BACKEND=api, PGW_LLM__BACKEND=api, PGW_LLM__API_MODEL=<provider/model>.

Workspace Output

pgw_workspace/
├── .cache/                           # Shared cache (cross-workspace)
│   ├── audio/                        # Extracted audio
│   ├── compressed/                   # API-compressed MP3s
│   ├── transcriptions/               # Whisper results (local + API)
│   └── downloads/                    # yt-dlp downloads + subtitles
└── my-video/
    └── 20260217_164802/
        ├── video.mp4                 # Symlinked from source
        ├── audio.wav                 # Symlinked from cache
        ├── transcription.fr.vtt      # Original subtitles (from Whisper or downloaded)
        ├── transcription.fr.txt      # Plain text
        ├── translation.en.vtt        # Translated subtitles
        ├── translation.en.txt        # Translation plain text
        ├── bilingual.fr-en.vtt       # Dual-language VTT
        ├── parallel.fr-en.pdf        # Side-by-side PDF
        ├── parallel.fr-en.epub       # Side-by-side EPUB
        ├── vocabulary.fr.json        # CEFR analysis + rare words
        └── metadata.json

Transcription Backends

Backend Technology Pros Limits
Local (default) stable-ts Best quality, word-level timestamps, custom regrouping Requires GPU / model downloads
Cloud API LiteLLM Fast, cheap, no GPU, auto-compresses large files API key required
# Local
pgw transcribe audio.wav -l fr                              # large-v3-turbo on MLX
pgw transcribe audio.wav -l fr --whisper-model medium        # smaller model

# Cloud API (any LiteLLM-supported provider)
pgw transcribe audio.wav --backend api -l fr
pgw transcribe audio.wav --backend api --whisper-model openai/whisper-1 -l fr

Vocabulary Analysis

Each processed video gets a vocabulary profile: CEFR level estimation via wordfreq, top 30 rare words with context and translation, spaCy lemmatization to group inflected forms.

pgw vocab pgw_workspace/my-video/20260217_164802/ --top 50

How It Works

Video/Audio/URL
  → Download (yt-dlp, cached) + fetch existing subtitles
  → Extract Audio (ffmpeg, cached)
  → Use downloaded subtitles OR Transcribe (Whisper + spaCy segmentation)
  → Refine transcription (LLM, optional — fixes ASR errors, punctuation)
  → Translate (LLM, optional — sentence-boundary chunking with overlap)
  → Export (VTT/TXT/bilingual VTT/PDF/EPUB) + Vocabulary Analysis
  → Play (mpv or web player)

Tech Stack

Component Technology
Transcription stable-ts (MLX/CUDA/CPU)
Cloud APIs LiteLLM (Groq, OpenAI, Ollama, Claude)
NLP spaCy (26 language codes) + wordfreq
Export WeasyPrint (PDF) + ebooklib (EPUB)
Subtitles pysubs2
Download yt-dlp
Playback mpv
CLI Typer + Rich

Supported Languages

Whisper supports 100 languages — run pgw languages for the full list. spaCy POS tagging and clitic handling covers 26 language codes (including Norwegian no/nn aliases).

Common language codes
Code Language Code Language Code Language
fr French zh Chinese pl Polish
en English ja Japanese sv Swedish
de German ko Korean da Danish
es Spanish ar Arabic fi Finnish
it Italian ru Russian uk Ukrainian
pt Portuguese hi Hindi vi Vietnamese
nl Dutch tr Turkish

Roadmap

  • Whisper transcription (local + cloud API) with word-level timestamps
  • LLM translation + dual subtitle playback
  • spaCy subtitle segmentation + Romance clitic handling (26 language codes)
  • Audio cache, batch processing, vocabulary analysis, parallel text export
  • Streaming pipeline event system
  • Subtitle download from video pages, web player, content-addressable cache
  • Hosted demo (Gradio on Hugging Face Spaces)
  • Speaker diarization
  • Anki card generation from subtitle pairs

License

MIT

About

Transcribe, translate, and learn — Whisper + LLM video pipeline with dual subtitles, vocabulary analysis, and web player.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages