Open-source real-time voice AI platform with modular ASR, LLM, and TTS pipelines.
Self-hosted alternative to OpenAI Realtime API — sub-450ms latency, 100+ concurrent sessions.
Quick Start • Playground • API Docs • Architecture • Contributing
OpenAI's Realtime API is powerful, but expensive ($0.06+/min), not self-hostable, and vendor-locked. We built this so you can run real-time voice AI on your own infrastructure, with any model you choose.
| OpenAI Realtime | This Project | |
|---|---|---|
| Deployment | Cloud only | Self-hosted |
| Data privacy | Third-party | Fully private |
| LLM | GPT-4o only | Any OpenAI-compatible |
| TTS | OpenAI only | Edge / MiniMax / Azure / VolcEngine |
| ASR | Whisper only | WhisperLive |
| Cost | Per-minute billing | Fixed server cost |
| Latency | ~500ms | ≤450ms |
curl -fsSL https://raw.githubusercontent.com/SquadyAI/RealtimeAPI/main/server/install.sh | bash
realtime onboardThe wizard walks you through LLM, TTS, and ASR configuration, creates .env, builds the binary, and starts the server. Open http://localhost:8080 to try the built-in Playground.
Prerequisites
- An OpenAI-compatible LLM API key (OpenAI / DeepSeek / Qwen / Ollama)
Install from source
git clone https://github.com/SquadyAI/RealtimeAPI.git && cd RealtimeAPI/server
cp .env.example .env # edit .env — set LLM_BASE_URL, LLM_MODEL at minimum
cargo build --release
realtime onboardRequires Rust nightly and cmake.
See server/.env.example for all configuration options.
Windows
The one-line installer requires bash (macOS / Linux). On Windows, install from source via WSL or natively:
Option A — WSL (recommended):
wsl --install # if you don't have WSL yet
# then run the standard install command inside WSL
curl -fsSL https://raw.githubusercontent.com/SquadyAI/RealtimeAPI/main/server/install.sh | bashOption B — Native Windows:
git clone https://github.com/SquadyAI/RealtimeAPI.git
cd RealtimeAPI\server
copy .env.example .env
cargo build --release
.\target\release\realtime.exeEdit .env before starting — set at minimum:
LLM_BASE_URL— e.g.https://api.groq.com/openai/v1(free at console.groq.com)LLM_API_KEY— your API keyLLM_MODEL— e.g.llama-3.3-70b-versatileWHISPERLIVE_PATH— your WhisperLive WebSocket URL
The interactive
realtime onboardwizard is bash-only (macOS / Linux / WSL). On native Windows, configure.envmanually.
Requires Rust, cmake, and Visual Studio Build Tools (C++ workload).
Docker
docker run -p 8080:8080 \
-e LLM_BASE_URL=https://api.openai.com/v1 \
-e LLM_API_KEY=sk-xxx \
-e LLM_MODEL=gpt-4o-mini \
ghcr.io/squadyai/realtime:latestTry it online: https://port2.luxhub.top:2097/ — no setup needed.
Or self-host: start the server and open http://localhost:8080 — a fully functional voice conversation UI is built in.
- Voice assistants — Smart speakers, in-car assistants, customer service bots
- Real-time translation — Simultaneous interpretation across 25+ languages
- Smart device control — Voice-controlled IoT via built-in function-calling agents
- AI tutoring — Interactive language learning with real-time speech feedback
- Accessibility tools — Voice interfaces for applications
WebSocket (Opus audio)
┌────────────────────────────────────────┐
│ │
▼ │
┌───────────────────┐ ┌────────────────────────┴───────────────────────────┐
│ │ │ Realtime API Server │
│ Client │ │ │
│ │ Opus │ ┌───────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ Microphone ─────┼────────▶│ │ VAD │──▶│ ASR │──▶│ LLM │──▶│ TTS │──┐ │
│ │ │ │Silero │ │Whis-│ │Open-│ │Edge/│ │ │
│ │ Opus │ │+Smart │ │per- │ │ AI │ │Mini-│ │ │
│ Speaker ◀───────┼────────◀│ │ Turn │ │Live │ │comp.│ │Max/ │ │ │
│ │ │ └───────┘ └─────┘ └──┬──┘ │Azure│ │ │
│ │ │ │ └─────┘ │ │
└───────────────────┘ │ ▼ ▲ │ │
│ ┌────────────┐ │ │ │
│ │ Agents │ │ Paced │
│ │ + MCP Tools│──┘ Sender │
│ └────────────┘ ◀──┘ │
│ │
└────────────────────────────────────────────────────┘
Full architecture diagrams: docs/architecture.md
| Protocol ID | Mode | Pipeline |
|---|---|---|
100 |
Full conversation (default) | ASR → LLM → TTS |
1 |
ASR only | Audio → Text |
2 |
LLM only | Text → Text |
3 |
TTS only | Text → Audio |
4 |
Translation | Real-time interpretation |
| Category | Provider | Status | Notes |
|---|---|---|---|
| ASR | WhisperLive | Default | Streaming, multi-language |
| LLM | Any OpenAI-compatible | Default | GPT, DeepSeek, Qwen, Ollama, vLLM, etc. |
| TTS | Edge TTS | Default | Free, 100+ languages |
| TTS | MiniMax | Alternative | Chinese optimized, 50+ voices |
| TTS | Azure Speech | Alternative | High quality, multi-language |
| TTS | VolcEngine | Alternative | Chinese voices |
| TTS | Baidu TTS | Alternative | Chinese voices |
| VAD | Silero + SmartTurn | Default | Two-layer: acoustic (32ms) + semantic |
| Tools | MCP Protocol | Built-in | Dynamic tool extension |
| Tools | Function-calling Agents | Built-in | Search, translate, navigate, device control, etc. |
| Metric | Value | Notes |
|---|---|---|
| End-to-end latency | ≤ 450ms | Dominated by AI model inference (ASR/LLM/TTS) |
| VAD inference | 82 us/frame | 2.6x realtime headroom on 32ms frames |
| Text splitting | 73 us/turn | Streaming sentence segmentation for TTS |
| Frame pacing | <20 ns/frame | Jitter-free audio delivery |
| Binary protocol | 68 ns/frame | 10-30x faster than JSON for audio |
| Concurrent sessions | 100+ | Single node |
| Memory footprint | ~200MB | Base runtime |
The Rust pipeline (VAD → text splitting → protocol → pacing) adds <0.5ms total — less than 0.1% of end-to-end latency. The bottleneck is entirely in AI model inference, which is the correct design.
Benchmark details: docs/benchmarks.md | Run:
cd server && cargo bench
Production-grade features:
- Timeout + fallback for external services
- Connection pooling for LLM/TTS
- Graceful shutdown with in-flight session draining
- Structured logging (tracing) + Prometheus metrics + Langfuse integration
- Hot-reload TTS voice parameters without restart
const ws = new WebSocket('ws://localhost:8080/ws');
// 1. Configure session
ws.send(JSON.stringify({
protocol_id: 100,
command_id: 1,
session_id: 'my-session',
payload: {
type: 'session_config',
mode: 'vad',
system_prompt: 'You are a helpful assistant.',
voice_setting: { voice_id: 'zh_female_wanwanxiaohe_moon_bigtts' }
}
}));
// 2. Send audio (binary: 32-byte header + PCM16 data)
ws.send(audioBuffer);
// 3. Receive responses
ws.onmessage = (event) => {
if (typeof event.data === 'string') {
const msg = JSON.parse(event.data);
// ASR transcription, LLM text deltas, function calls...
} else {
// TTS audio chunks — play directly
}
};Full protocol reference: Realtime_API_Guide.md
realtime onboard # Interactive setup wizard
realtime onboard # Start the server (logs to logs/realtime.log)
realtime doctor # Diagnose configuration and connectivity issuesAll configuration is via environment variables. The setup wizard (realtime onboard) handles this interactively.
| Variable | Required | Description | Default |
|---|---|---|---|
LLM_BASE_URL |
Yes | OpenAI-compatible API endpoint | — |
LLM_MODEL |
Yes | Model name | — |
LLM_API_KEY |
No | API key (optional for self-hosted LLMs) | — |
ENABLE_TTS |
No | Enable voice synthesis | true |
TTS_ENGINE |
No | TTS engine (edge, minimax, azure, volc, baidu) |
edge |
BIND_ADDR |
No | Listen address | 0.0.0.0:8080 |
VAD_THRESHOLD |
No | VAD sensitivity (0.0–1.0) | 0.6 |
MAX_CONCURRENT_SESSIONS |
No | Max concurrent sessions | 100 |
See server/.env.example for all options.
Realtime/
├── server/ # Rust core server
│ ├── src/
│ │ ├── main.rs # Entry point
│ │ ├── rpc/ # WebSocket server, session management, pipeline factory
│ │ │ └── pipeline/ # ASR→LLM→TTS orchestration, translation, etc.
│ │ ├── asr/ # WhisperLive streaming ASR
│ │ ├── llm/ # OpenAI-compatible client, function calling, history
│ │ ├── tts/ # Edge, MiniMax, Azure, VolcEngine, Baidu
│ │ ├── vad/ # Silero VAD + SmartTurn semantic detection
│ │ ├── agents/ # Function-calling agents
│ │ ├── mcp/ # Model Context Protocol client
│ │ ├── audio/ # PCM preprocessing, resampling, Opus codec
│ │ └── storage/ # PostgreSQL + in-memory fallback
│ └── Cargo.toml
└── clients/
└── typescript/ # Web Playground (React)
| Document | Description |
|---|---|
| API Guide | WebSocket protocol reference |
| Architecture | System design with Mermaid diagrams |
| Benchmarks | Performance data with test conditions |
| .env.example | Full configuration reference |
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Development
cd server
cargo build # Debug build
cargo test # Run tests
realtime doctor # Verify setup| Feature | Description | Status |
|---|---|---|
| Long-term Memory | Cross-session memory, user preference persistence | Planned |
| Agent Collaboration | Multi-agent orchestration and task delegation | Planned |
| Multimodal | Vision input (camera / screenshots) + voice, for GPT-4o class models | Planned |
| Voice Cloning | Few-shot voice cloning — talk in your own voice | Planned |
| Speaker Identification | Distinguish who is speaking in multi-person scenarios | Planned |
| Voiceprint Authentication | Speaker verification via voice biometrics | Planned |
| Hosted ASR | Zero-setup ASR — no self-hosting needed for new users | Planned |
Have ideas or want to contribute? Open an issue or check CONTRIBUTING.md.