- Batch transcribe multiple files in a directory and optionally all sub-directories.
- Optional timestamps with configurable segment intervals
- Works on GPU (CUDA) and CPU, Windows or Linux
- Supported file types: AAC, AMR, ASF, AVI, FLAC, M4A, MKV, MP3, MP4, WAV, WEBM, WMA
- Link to article on Medium
Download and run
Elegant_Transcriber_Setup.exe(right-click and run as administrator)
Download the latest release, unzip and extract, then navigate to the directory containing
main.pyand run:
python -m venv .
.\Scripts\activate
python install.py
python main.py
Download the latest release, unzip and extract, then navigate to the directory containing
main.pyand run:
python3 -m venv .
source bin/activate
python install.py
python main.py
- ~2.5 hour audio file here:
sam_altman_lex_podcast_367.flac
| Library | Model | Batch | Chunk | VRAM Usage | Time | Real Time | Quality Ranking |
|---|---|---|---|---|---|---|---|
| Elegant Transcriber (NeMo) | Parakeet TDT 0.6B v2 | 1 | 90s | ~3.3 GB | 14.9s | 580x | #8 |
| Transformers | Whisper Large v3 | 32 | Default | ~12.4 GB | 52.2s | 166x | #32 |
| WhisperS2T Reborn (Ctranslate2) | Whisper Large v3 | 32 | Default | ~13.4 GB | 66.9s | 129x | #32 |
| Faster-Whisper (Ctranslate2) | Whisper Large v3 | 32 | Default | ~12.5 GB | 75.9s | 114x | #32 |
| WhisperX (Ctranslate2) | Whisper Large v3 | 32 | Default | ~12.8 GB | 71.8s | 120x | #32 |
| Transformers | Granite 4.0 1B Speech | 12 | 30s | ~6.3 GB | 97.7s | 88x | #1 |
| Elegant Transcriber (NeMo) | Canary-Qwen-2.5b | 1 | 40s | ~11.1 GB | 639.8ss | 13.5x | #2 |
All models were run in
bfloat16.
All VRAM measurements include model weights and inference overhead and subtract background usage.
All parameters were chosen to achieve a maximum throughput of ~90% CUDA core usage on an RTX 4090.
- ~13 minute private audio file.
- CPU tests use a shorter audio sample to keep runtimes manageable.
| Library | Model | Batch | Chunk | RAM Usage | Time | Real Time | Quality Ranking |
|---|---|---|---|---|---|---|---|
| Elegant Transcriber | Parakeet TDT 0.6B v2 | 1 | 90s | ~5.6 GB | 29.0s | 26.8x | #8 |
| Faster-Whisper (Ctranslate2) | Whisper Large v3 | 1 | Default | ~6.5 GB | 211.8s | 3.67x | #32 |
| WhisperS2T Reborn (Ctranslate2) | Whisper Large v3 | 1 | Default | ~6.6 GB | 257.9s | 3.02x | #32 |
| Transformers | Whisper Large v3 | 1 | Default | ~6.6 GB | 311.1s | 2.50x | #32 |
| Elegant Transcriber (NeMo) | Canary-Qwen-2.5b | 1 | 40s | ~11.1 GB | 370.1ss | 2.1x | #2 |
| WhisperX (Ctranslate2) | Whisper Large v3 | 1 | Default | ~7.3 GB | 396.4s | 1.96x | #32 |
All models were loaded in
float32for CPU compatibility.
20 threads were used on an Intel 13900k resulting in ~90% CPU usage.
I couldn't get Granite Speech to run...
- Nvidia for the Parkeet models, which are hands down the best balance of accuracy and compute time for most people IMHO.
- IBM for Granite Speech Models, which, as of March, 2026, rank #1 on the ASR leaderboard in terms of accuracy. I'll include them in a later release.
- OpenAI for the older Whisper models setting the gold standard for so many years.
