Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 71 additions & 19 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,43 +4,95 @@ This file provides guidance to Claude Code when working with this repository.

## Project Overview

TTS Server is a multi-model text-to-speech API with voice cloning support. It's designed to be backend-agnostic, allowing different TTS models to be plugged in.
Studio Server is an AI-powered studio utilities API for video production. It provides modular backends for:

- **TTS**: Text-to-speech with voice cloning (Qwen3-TTS)
- **Face**: Face embedding extraction for IP-Adapter FaceID (InsightFace)
- **Transcription**: Audio transcription with word-level timestamps (Whisper)

## Architecture

```
server.py
├── TTSBackend (abstract base class)
│ └── Qwen3TTSBackend (implementation)
├── BACKENDS registry
└── FastAPI application
studio-server/
├── server.py # FastAPI application with all endpoints
├── backends/
│ ├── __init__.py # Backend exports
│ ├── base.py # Abstract Backend base class
│ ├── tts.py # TTSBackend + Qwen3TTSBackend
│ ├── face.py # FaceBackend + InsightFaceBackend
│ └── transcription.py # TranscriptionBackend + WhisperBackend
└── tests/
```

## Key Design Decisions

1. **Backend Abstraction**: All TTS models implement `TTSBackend` interface
2. **ref_text Support**: Voice cloning accepts both audio and transcript for quality
3. **Stateless**: No voice profile storage - consuming app manages assets
4. **GPU First**: Designed for CUDA, falls back to CPU
1. **Modular Backends**: Each capability (TTS, Face, Transcription) has its own backend abstraction
2. **Optional Loading**: Face and Transcription backends can be disabled via environment variables
3. **Legacy Compatibility**: Old TTS endpoints remain for backwards compatibility
4. **Stateless**: No asset storage - consuming app manages files
5. **GPU First**: Designed for CUDA, falls back to CPU

## API Structure

**TTS Endpoints:**
- `GET /v1/tts/speakers` - List available speakers
- `POST /v1/tts/extract` - Extract voice prompt from audio
- `POST /v1/tts/synthesize` - Synthesize speech

**Face Endpoints:**
- `POST /v1/face/embed` - Extract face embedding from image
- `POST /v1/face/embed-all` - Extract all faces from image
- `POST /v1/face/compare` - Compare two embeddings

**Transcription Endpoints:**
- `POST /v1/transcribe` - Transcribe audio with word timings

**Health/Info:**
- `GET /health` - Health check with backend status
- `GET /v1/models` - List available backends

## Environment Variables

## API
| Variable | Default | Description |
|----------|---------|-------------|
| `TTS_BACKEND` | `qwen3-tts` | TTS backend (`qwen3-tts`, `mock`) |
| `FACE_ENABLED` | `true` | Load face backend |
| `FACE_BACKEND` | `insightface` | Face backend |
| `TRANSCRIPTION_ENABLED` | `true` | Load transcription backend |
| `TRANSCRIPTION_BACKEND` | `whisper` | Transcription backend |

- `POST /v1/audio/speech` - Main synthesis endpoint
- `GET /health` - Health check
- `GET /v1/models` - List backends
## Development Mode

For local development without GPU, use the mock TTS backend:

```bash
TTS_BACKEND=mock FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py
```

## Adding New Backends

1. Extend `TTSBackend` class
2. Implement `load()`, `synthesize()`, `get_info()`
3. Add to `BACKENDS` dict
1. Create or extend the appropriate backend class in `backends/`
2. Implement `load()`, `get_info()`, and capability-specific methods
3. Add to the `*_BACKENDS` registry dict
4. Update environment variable handling in `server.py` if needed

## Deployment

Deployed to happyvertical k8s cluster via Flux GitOps.
Manifests in: `happyvertical/iac/manifests/applications/tts-server/`
Manifests in: `happyvertical/iac/manifests/applications/studio-server/`

## Related Packages

- `@happyvertical/ai` - SDK package that may consume this service
- `@happyvertical/histrio` - Video production agent that consumes this service
- `@happyvertical/ai` - SDK package with TTS client
- SMRT voice packages - May integrate via TypeScript client

## Testing

```bash
# Run tests
pytest

# Run with coverage
pytest --cov=backends --cov=server
```
17 changes: 12 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,28 +1,35 @@
# TTS Server - Multi-model text-to-speech API
# Supports: Qwen3-TTS with voice cloning
# Studio Server - AI-powered studio utilities for video production
# Supports: TTS (Qwen3-TTS), Face Embedding (InsightFace), Transcription (Whisper)

FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

WORKDIR /app

# Install system dependencies
# Install system dependencies (including build tools for insightface)
RUN apt-get update && apt-get install -y --no-install-recommends \
libsndfile1 \
ffmpeg \
sox \
libsox-dev \
libgl1-mesa-glx \
libglib2.0-0 \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY backends/ ./backends/
COPY server.py .

# Environment variables
# Environment variables - defaults
ENV TTS_BACKEND=qwen3-tts
ENV TTS_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base
ENV FACE_ENABLED=true
ENV FACE_BACKEND=insightface
ENV TRANSCRIPTION_ENABLED=true
ENV TRANSCRIPTION_BACKEND=whisper

# Expose port
EXPOSE 8000
Expand Down
Loading