Skip to content

Conversation

@coding-crying
Copy link

Summary

Adds an OpenAI-compatible TTS server with both HTTP and WebSocket endpoints for real-time streaming TTS applications.

Endpoints

  • POST /v1/audio/speech - HTTP streaming (OpenAI API compatible)
  • WS /v1/audio/speech/stream - Bidirectional WebSocket streaming

WebSocket Protocol

  1. Connect to ws://host:port/v1/audio/speech/stream
  2. Send config: {"voice": "speaker.wav", "speed": 1.0}
  3. Send text chunks: {"text": "Hello"}
  4. Send end signal: {"event": "end"}
  5. Receive binary PCM audio (int16, 24kHz, mono)
  6. Server closes connection when complete

Files Added/Modified

  • openai_server.py: FastAPI server with HTTP + WebSocket TTS endpoints
  • run_openai_server.sh: Launch script with venv activation
  • cosyvoice/llm/llm.py: Add inference_bistream to CosyVoice3LM class for streaming support

Usage

bash run_openai_server.sh
# Server runs on http://0.0.0.0:50000

Use Case

This enables CosyVoice to be used as a drop-in TTS backend for:

  • LiveKit voice agents
  • Real-time conversational AI
  • Any application needing streaming TTS with WebSocket support

Tested with CosyVoice3-0.5B model.

Adds an OpenAI-compatible TTS server with both HTTP and WebSocket endpoints:

## Endpoints
- `POST /v1/audio/speech` - HTTP streaming (OpenAI-compatible)
- `WS /v1/audio/speech/stream` - Bidirectional WebSocket streaming

## WebSocket Protocol
1. Connect to `ws://host:port/v1/audio/speech/stream`
2. Send config: `{"voice": "speaker.wav", "speed": 1.0}`
3. Send text chunks: `{"text": "Hello"}`
4. Send end signal: `{"event": "end"}`
5. Receive binary PCM audio (int16, 24kHz, mono)
6. Server closes connection when complete

## Changes
- `openai_server.py`: FastAPI server with HTTP + WebSocket TTS endpoints
- `run_openai_server.sh`: Launch script with venv activation
- `cosyvoice/llm/llm.py`: Add `inference_bistream` to CosyVoice3LM for streaming support

## Usage
```bash
bash run_openai_server.sh
# Server runs on http://0.0.0.0:50000
```

Tested with CosyVoice3-0.5B model for real-time voice agent applications.
@LongQIByte
Copy link

Thanks for the PR! I’m not a maintainer, but I’ll pull the branch and try it out locally. I’ll share feedback if I find anything.

@coding-crying
Copy link
Author

Thanks for the PR! I’m not a maintainer, but I’ll pull the branch and try it out locally. I’ll share feedback if I find anything.

After more testing, I was having some odd chunking splits on my 3090, but I think it may have been because my 3090 had an RTF above 1. If you have a beefier GPU let me know if this issue persists. :) -will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants