CLI utilities for inspecting and cleaning a TTS corpus.
- Requires Python 3.9+.
- Install in editable mode (ensures entry point is available):
pip install -e . - The tools use
soundfile,mutagen, andpandas. Ifpandasis missing, add it:pip install pandas.
- Show help:
tts-corpus-workbench --help - All commands accept comma-separated audio extensions (default:
.wav,.mp3,.flac,.m4a).
- Count total duration in a folder:
tts-corpus-workbench compute-audio-hours --folder data/audio --extensions ".wav,.flac"
- Compare metadata CSV with an audio folder:
tts-corpus-workbench find-orphan-audio --metadata metadata.csv --audio-col file --folder data/audio - Prints counts for matches/missing/orphans; add
--delete-orphanto remove files on disk that are not in metadata.
- Export acronym frequency stats from a text column:
tts-corpus-workbench detect-acronyms --metadata metadata.csv --text-col text --output acronym_stats.csv