happyvertical · willgriffin · Jan 26, 2026 · Jan 26, 2026 · Jan 26, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,43 +4,95 @@ This file provides guidance to Claude Code when working with this repository.
 
 ## Project Overview
 
-TTS Server is a multi-model text-to-speech API with voice cloning support. It's designed to be backend-agnostic, allowing different TTS models to be plugged in.
+Studio Server is an AI-powered studio utilities API for video production. It provides modular backends for:
+
+- **TTS**: Text-to-speech with voice cloning (Qwen3-TTS)
+- **Face**: Face embedding extraction for IP-Adapter FaceID (InsightFace)
+- **Transcription**: Audio transcription with word-level timestamps (Whisper)
 
 ## Architecture
 
 ```
-server.py
-├── TTSBackend (abstract base class)
-│   └── Qwen3TTSBackend (implementation)
-├── BACKENDS registry
-└── FastAPI application
+studio-server/
+├── server.py              # FastAPI application with all endpoints
+├── backends/
+│   ├── __init__.py        # Backend exports
+│   ├── base.py            # Abstract Backend base class
+│   ├── tts.py             # TTSBackend + Qwen3TTSBackend
+│   ├── face.py            # FaceBackend + InsightFaceBackend
+│   └── transcription.py   # TranscriptionBackend + WhisperBackend
+└── tests/
 ```
 
 ## Key Design Decisions
 
-1. **Backend Abstraction**: All TTS models implement `TTSBackend` interface
-2. **ref_text Support**: Voice cloning accepts both audio and transcript for quality
-3. **Stateless**: No voice profile storage - consuming app manages assets
-4. **GPU First**: Designed for CUDA, falls back to CPU
+1. **Modular Backends**: Each capability (TTS, Face, Transcription) has its own backend abstraction
+2. **Optional Loading**: Face and Transcription backends can be disabled via environment variables
+3. **Legacy Compatibility**: Old TTS endpoints remain for backwards compatibility
+4. **Stateless**: No asset storage - consuming app manages files
+5. **GPU First**: Designed for CUDA, falls back to CPU
+
+## API Structure
+
+**TTS Endpoints:**
+- `GET /v1/tts/speakers` - List available speakers
+- `POST /v1/tts/extract` - Extract voice prompt from audio
+- `POST /v1/tts/synthesize` - Synthesize speech
+
+**Face Endpoints:**
+- `POST /v1/face/embed` - Extract face embedding from image
+- `POST /v1/face/embed-all` - Extract all faces from image
+- `POST /v1/face/compare` - Compare two embeddings
+
+**Transcription Endpoints:**
+- `POST /v1/transcribe` - Transcribe audio with word timings
+
+**Health/Info:**
+- `GET /health` - Health check with backend status
+- `GET /v1/models` - List available backends
+
+## Environment Variables
 
-## API
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `TTS_BACKEND` | `qwen3-tts` | TTS backend (`qwen3-tts`, `mock`) |
+| `FACE_ENABLED` | `true` | Load face backend |
+| `FACE_BACKEND` | `insightface` | Face backend |
+| `TRANSCRIPTION_ENABLED` | `true` | Load transcription backend |
+| `TRANSCRIPTION_BACKEND` | `whisper` | Transcription backend |
 
-- `POST /v1/audio/speech` - Main synthesis endpoint
-- `GET /health` - Health check
-- `GET /v1/models` - List backends
+## Development Mode
+
+For local development without GPU, use the mock TTS backend:
+
+```bash
+TTS_BACKEND=mock FACE_ENABLED=false TRANSCRIPTION_ENABLED=false python server.py
+```
 
 ## Adding New Backends
 
-1. Extend `TTSBackend` class
-2. Implement `load()`, `synthesize()`, `get_info()`
-3. Add to `BACKENDS` dict
+1. Create or extend the appropriate backend class in `backends/`
+2. Implement `load()`, `get_info()`, and capability-specific methods
+3. Add to the `*_BACKENDS` registry dict
+4. Update environment variable handling in `server.py` if needed
 
 ## Deployment
 
 Deployed to happyvertical k8s cluster via Flux GitOps.
-Manifests in: `happyvertical/iac/manifests/applications/tts-server/`
+Manifests in: `happyvertical/iac/manifests/applications/studio-server/`
 
 ## Related Packages
 
-- `@happyvertical/ai` - SDK package that may consume this service
+- `@happyvertical/histrio` - Video production agent that consumes this service
+- `@happyvertical/ai` - SDK package with TTS client
 - SMRT voice packages - May integrate via TypeScript client
+
+## Testing
+
+```bash
+# Run tests
+pytest
+
+# Run with coverage
+pytest --cov=backends --cov=server
+```
diff --git a/Dockerfile b/Dockerfile
@@ -1,28 +1,35 @@
-# TTS Server - Multi-model text-to-speech API
-# Supports: Qwen3-TTS with voice cloning
+# Studio Server - AI-powered studio utilities for video production
+# Supports: TTS (Qwen3-TTS), Face Embedding (InsightFace), Transcription (Whisper)
 
 FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
 
 WORKDIR /app
 
-# Install system dependencies
+# Install system dependencies (including build tools for insightface)
 RUN apt-get update && apt-get install -y --no-install-recommends \
     libsndfile1 \
     ffmpeg \
     sox \
     libsox-dev \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    build-essential \
     && rm -rf /var/lib/apt/lists/*
 
 # Install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 
 # Copy application
+COPY backends/ ./backends/
 COPY server.py .
 
-# Environment variables
+# Environment variables - defaults
 ENV TTS_BACKEND=qwen3-tts
-ENV TTS_MODEL=Qwen/Qwen3-TTS-12Hz-1.7B-Base
+ENV FACE_ENABLED=true
+ENV FACE_BACKEND=insightface
+ENV TRANSCRIPTION_ENABLED=true
+ENV TRANSCRIPTION_BACKEND=whisper
 
 # Expose port
 EXPOSE 8000