Phonebooth – Centralised STT/TTS Service

Phonebooth wraps two heavy-weight speech models – Whisper for speech-to-text (STT) and XTTS-v2 for multilingual text-to-speech (TTS) – in a single Python module and a small FastAPI application.

Importing the module automatically initialises both models once per Python process so that every part of your codebase can share GPU memory. If you prefer network isolation or a polyglot environment, the same object is also exposed through an HTTP micro-service.

Features

📝 Transcription – whisper-large-v3 running through faster-whisper for blazingly fast inference.
🌐 Language Support – supports multiple languages for both transcription and synthesis.
🔊 Synthesis – XTTS-v2 multilingual TTS with 100+ speakers.
🔄 Single initialisation – heavy models are loaded once and re-used everywhere.
⚡ Async helpers – await transcribe_audio() & await synthesize().
🛰️ Micro-service – optional FastAPI server with streaming responses.
📦 Docker image – builds models at image-creation time for near-instant container start-up.

Quick start

1. Install (local GPU or CPU)

# Clone this repository or copy phonebooth/ into your project
pip install -r phonebooth/requirements.txt

PyTorch wheels are not pinned in requirements.txt – install the flavour that matches your hardware & CUDA version before the step above, e.g.

pip install torch==2.7.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html

2. Use as a Python library

import phonebooth as pb

wav_bytes = Path("example.wav").read_bytes()
text = await pb.transcribe_audio(wav_bytes)
print(text)

wav_np = await pb.synthesize("Bonjour le monde!", speaker="1")
# save with torchaudio, sounddevice, etc.

3. Run as a micro-service

uvicorn phonebooth:app --host 0.0.0.0 --port 8000

Swagger UI will be available on the root path / by default. Set ENABLE_DOCS=false to disable it.

Proxy subroute support: When serving behind nginx or another reverse proxy at a subroute (e.g., https://example.com/api/phonebooth), set the ROOT_PATH environment variable to the subroute path (e.g., /api/phonebooth). This ensures URLs in OpenAPI docs and responses are generated correctly.

POST /transcribe   → { "text": "…" }
POST /tts          → audio/wav (full file)
POST /tts_stream   → application/octet-stream (chunked WAV)
GET  /speakers      → { "0": "Speaker A", "1": "Speaker B", … }

4. Docker (GPU)

A ready-to-run Dockerfile is provided:

# Build (models are downloaded & initialised during build)
docker build -t phonebooth .

# Run with NVIDIA runtime
docker run --gpus all -p 8000:8000 \
           -e NVIDIA_VISIBLE_DEVICES=0 \  # optional, limit GPU
           -e COQUI_TOS_AGREED=1 \         # agree to XTTS terms
           -e API_KEY=your-secret-key \    # optional, enable API authentication
           phonebooth

Container start-up is now instant because the heavy models were already loaded at build time (RUN python phonebooth.py).

Docker Compose: A compose.yml file is included for easy deployment. Copy env.example to .env and update the values:

cp env.example .env
# Edit .env with your preferred values

The compose file will automatically load variables from .env (the file is gitignored for security). See env.example for all available configuration options.

API reference

Async helpers (Python)

await transcribe_audio(raw_audio: bytes) -> str
await synthesize(text: str, speaker: str | None = None) -> numpy.ndarray

raw_audio must be 16-bit PCM WAV bytes (any sampling rate, mono/stereo).
speaker may be either a numeric index as string or an exact speaker name.
The language for TTS is auto-detected via py3langid (configurable mapping inside the module).

HTTP endpoints (FastAPI)

Method	Path	Body	Response
POST	`/transcribe`	`multipart/form-data` file=`wav`	`{ "text": "…" }`
POST	`/tts`	JSON: `{text, speaker}`	`audio/wav`
POST	`/tts_stream`	JSON: `{text, speaker, chunk_seconds}`	binary stream (2 s chunks by default)
GET	`/speakers`	–	`{ "id": "name", … }`
GET	`/demo`	–	HTML demo page (requires `ENABLE_DEMO=true`)

Authentication: When the API_KEY environment variable is set, all API endpoints (except /demo) require Bearer token authentication. Include the header Authorization: Bearer <your-api-key> in requests. The /demo page is publicly accessible (when enabled) and includes an optional API key input field for authenticated API calls.

Streaming TTS chunks let you start playback while the rest of the sentence is still being synthesised – ideal for real-time chat assistants.

Configuration

Variable	Purpose	Default
`API_KEY`	API key for Bearer token authentication. When set, all endpoints (except `/demo`) require `Authorization: Bearer <key>` header	–
`ENABLE_DEMO`	Enable the `/demo` endpoint. Set to `true`, `1`, or `yes` to enable the demo page	`false`
`ENABLE_DOCS`	Enable OpenAPI documentation (Swagger UI). Set to `false`, `0`, or `no` to disable	`true`
`ROOT_PATH`	Root path when served behind a reverse proxy at a subroute (e.g., `/api/phonebooth`). Leave empty for root deployment	–
`COQUI_TOS_AGREED`	Must be set to `1` to silence the XTTS licence prompt	–
`GLOG_minloglevel`	Reduce TensorFlow/Whisper spam (`export GLOG_minloglevel=2`)	–

Edit phonebooth.py if you want to:

change WHISPER_MODEL_SIZE (tiny, small, medium, large-v3, …)
pin a different compute type (int8_float16, float16, …)
replace the XTTS speaker list mapping

Troubleshooting

CUDA out of memory – lower the compute type (e.g. float16 → int8_float16) or run on CPU.
Libtorch/ONNX warnings – harmless; silence them with TORCH_WARNINGS=ignore.
Slow startup – first run downloads ~3 GB of model weights; subsequent runs are instant.

Licence

This repository only contains glue code; it relies on third-party models licensed under their respective terms. See the original projects for details. All original code in phonebooth is released under the MIT licence © 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
compose.yml		compose.yml
demo.html		demo.html
env.example		env.example
phonebooth.py		phonebooth.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phonebooth – Centralised STT/TTS Service

Features

Quick start

1. Install (local GPU or CPU)

2. Use as a Python library

3. Run as a micro-service

4. Docker (GPU)

API reference

Async helpers (Python)

HTTP endpoints (FastAPI)

Configuration

Troubleshooting

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phonebooth – Centralised STT/TTS Service

Features

Quick start

1. Install (local GPU or CPU)

2. Use as a Python library

3. Run as a micro-service

4. Docker (GPU)

API reference

Async helpers (Python)

HTTP endpoints (FastAPI)

Configuration

Troubleshooting

Licence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages