A production-ready AI voice assistant built with FastAPI, Twilio Media Streams, and OpenAI’s Realtime API.
- Real-time speech recognition: Transcribes caller audio via OpenAI Realtime (
input.text.delta). - OpenAI TTS (sage): Streams AI-generated voice audio directly in G.711 μ-law format (
response.audio.delta). - Health check endpoint:
/healthreturns200 OKfor uptime monitoring. - Dashboard:
/messagesdisplays call logs in a styled HTML UI. - Persistent logging: All conversations logged to
messages.json. - Retry logic: Backoff-based reconnect for OpenAI WebSockets.
- Python 3.12+
- ffmpeg (optional for custom audio handling)
- A Twilio account with a phone number and Media Streams enabled
- OpenAI API key (with Realtime access)
- Clone your private repo:
git clone git@github.com:YOUR_USER/YOUR_REPO.git cd YOUR_REPO - Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate pip install -r requirements.txt
Create a .env file in the project root with:
# OpenAI
OPENAI_API_KEY=sk-...
SYSTEM_MESSAGE_PATH=./prompt.txt
VOICE=sage # change to another voice if desired
# Twilio
TWILIO_ACCOUNT_SID=AC...
TWILIO_AUTH_TOKEN=...
# Server
PORT=5050Note: Do not commit
.envto Git; it’s included in.gitignore.
uvicorn main:app --reload --host 0.0.0.0 --port $PORT- In Twilio Console, configure your phone number’s Webhook for Voice →
Incoming Callto:https://<your-ngrok-or-domain>/incoming-call - Start ngrok (if local):
ngrok http $PORT
- Add
railway.nixfor Python + ffmpeg (if needed). - Ensure Railway Environment Variables match local
.env. - Deploy; Railway auto-detects
uvicorn main:app.
- Visit
GET /messagesto view caller number, timestamp, transcripts, and AI replies in a styled table.
- Incoming call at
/incoming-call: Twilio streams media to/media-stream. - WebSocket (
handle_media_stream):- Sends session update enabling audio and text modalities, with server-side VAD and realtime transcription.
- Streams caller audio to OpenAI.
- Transcription:
input.text.deltaevents are logged and shown on the dashboard. - AI Reply: OpenAI’s
response.audio.deltastreams back voice audio (sage) to Twilio in G.711 μ-law format. - Error Handling: Backoff for reconnects, graceful closure on socket
ConnectionClosedOK.
- No transcription? Ensure
modalitiesincludes"text"and"input_audio_transcription": {"type": "realtime"}insend_session_update(). - No AI audio? Check Twilio Media Streams config, verify
voiceandoutput_audio_formatin session. - 500 on /messages? Delete or reinitialize
messages.jsonto a valid JSON array ([]).
MIT © Syed Fahim