This service is a Node.js backend built on Inworld Runtime that powers:
- Real‑time Speech‑to‑Text (STT) over WebSocket
- Image+Text → LLM → TTS streaming ("ImageChat") over WebSocket
- Optional HTTP test endpoints for quick local validation
Unity connects via HTTP to create a session token, then upgrades to WebSocket for interactive audio/text/image exchange.
- Node.js 20+
- An Inworld AI account and API key
git clone https://github.com/inworld-ai/multimodal-companion-node
cd multimodal-companion-nodeCopy env template and edit values:
cp .env-sample .env
# Edit .env and set INWORLD_API_KEY (base64("apiKey:apiSecret")) and VAD_MODEL_PATH. Optionally set ALLOW_TEST_CLIENT for local HTML testing.Get your API key from the Inworld Portal.
npm install
npm run build
npm startServer output (expected):
- "VAD client initialized"
- "STT Graph initialized"
"Server running on http://localhost:<PORT>""WebSocket available at ws://localhost:<PORT>/ws?key=<session_key>"
multimodal-companion-node/
├── src/
│ ├── index.ts # Express HTTP server, WebSocket upgrade, session/token issuance, auth checks
│ ├── message_handler.ts # Parses client messages (TEXT, AUDIO, AUDIO_SESSION_END, IMAGE_CHAT)
│ ├── stt_graph.ts # Builds a single, long-lived STT GraphExecutor used across the process
│ ├── auth.ts # HMAC auth verification for HTTP/WS (compatible with Unity InworldAuth)
│ ├── constants.ts # Defaults for audio sample rates, VAD thresholds, text generation config
│ ├── event_factory.ts
│ ├── helpers.ts
│ └── types.ts
├── examples/
│ ├── test-audio.html # Local test page for audio streaming
│ └── test-image-chat.html # Local test page for image chat
├── assets/
│ └── models/
│ └── silero_vad.onnx # VAD model file used for voice activity detection
├── package.json
├── tsconfig.json
└── LICENSE
- Required
INWORLD_API_KEY: base64("apiKey:apiSecret")VAD_MODEL_PATH: local path to the VAD model (e.g.,silero_vad.onnx)
- Optional
PORT: server port (default 3000)HTTP_CHAT_MAX_CONCURRENCY: throttle HTTP/chatconcurrency (if enabled)ALLOW_TEST_CLIENT: when set totrue, enablesGET /get_access_tokento issue short‑lived{ sessionKey, wsToken }for local HTML tests. Do NOT enable in production.
Example .env (you can start from .env-sample):
INWORLD_API_KEY=xxxxxx_base64_apiKey_colon_apiSecret
VAD_MODEL_PATH=assets/models/silero_vad.onnx
PORT=3000
# Enable local HTML test helper endpoint (development only)
# ALLOW_TEST_CLIENT=trueFor local development and testing without Unity, you can use the HTML test pages:
- Enable the test client endpoint: Set
ALLOW_TEST_CLIENT=truein your.envfile and restart the server. - Access test pages:
http://localhost:<PORT>/test-audio- Stream microphone audiohttp://localhost:<PORT>/test-image- Submit prompts with images
- How it works: The pages call
GET /get_access_tokento obtain{ sessionKey, wsToken }, then connect tows://host/ws?key=...&wsToken=....
Security note: ALLOW_TEST_CLIENT is for local development only. Do NOT enable in production. Tokens are short‑lived (5 minutes) and single‑use.
- HTTP endpoints require HMAC header
Authorization: IW1-HMAC-SHA256 ...(seeauth.ts), generated by UnityInworldAuth. - WebSocket:
- Preferred: obtain a short-lived
wsTokenfromPOST /create-sessionand connect to/ws?key=<sessionKey>&wsToken=<token> - Fallback: full
Authorizationheader on the upgrade request (not recommended for clients)
- Preferred: obtain a short-lived
POST /create-session(protected)- Returns
{ sessionKey, wsToken } wsTokenis single-use, short-lived
- Returns
POST /chat(protected, optional)- Accepts
promptand an optionalimagefile (multipart) - Runs a one‑off LLM graph (non‑streaming) and returns
{ response }
- Accepts
- Client calls
POST /create-session(with HMAC auth) →{ sessionKey, wsToken } - Client connects:
ws://host/ws?key=<sessionKey>&wsToken=<token> - Client sends messages; server returns text/audio and
INTERACTION_ENDpackets
Client → Server messages:
{ type: "text", text: string }{ type: "audio", audio: number[][] }// streamed float32 chunks{ type: "audioSessionEnd" }// finalize the STT turn{ type: "imageChat", text: string, image: string, voiceId?: string }// image is data URL (base64)
Server → Client messages:
TEXT:{ text: { text, final }, routing: { source: { isAgent|isUser, name } } }AUDIO:{ audio: { chunk: base64_wav } }(streamed for TTS)INTERACTION_END: marks end of one turn / executionERROR:{ error }
- STT Graph
- Constructed once at server startup (
stt_graph.ts), reused for all STT requests - For each STT turn:
executor.start(input, { executionId: v4() })→ read first result →closeExecution(executionResult.outputStream)
- Constructed once at server startup (
- ImageChat Graph (LLM→TextChunking→TTS)
- Per WebSocket connection: one shared executor reused across image+text turns
- Rebuilt only if
voiceIdchanges for that connection - For each ImageChat turn:
executor.start(LLMChatRequest, { executionId: v4() })→ stream TTS chunks →closeExecution(executionResult.outputStream)
- LLM provider/model examples
- OpenAI:
{ provider: 'openai', modelName: 'gpt-4o-mini', stream: true|false } - Google Gemini:
{ provider: 'google', modelName: 'gemini-2.5-flash-lite', stream: true }
- OpenAI:
- Text generation (see
constants.ts→TEXT_CONFIG)temperature,topP,maxNewTokens, penalties, etc.- Typical: higher
temperature/topP→ more diverse; lower → more deterministic
- Input sample rate: 16 kHz (Unity mic)
- VAD: Silero VAD (ONNX) local inference for voice activity detection
- STT turn starts on voiced audio and finalizes when pauses exceed a threshold (
PAUSE_DURATION_THRESHOLD_MS)
- STT executor is global and reused (fast first token)
- ImageChat executor is per WebSocket connection and reused; serialize turns per connection
- Optionally enforce small concurrency limits for HTTP
/chatif enabled - Always
closeExecution(executionResult.outputStream)after reading results
- Keep concurrency conservative (2–4) unless the plan allows more resources
- Prefer long-lived shared executors with per‑turn
start(...)/closeExecution(...) - Expect GOAWAY after long idle periods; allow light retry or lazy re‑init on next turn
- No image update: Confirm Unity captures a fresh image before each
imageChatsend - Long STT delay: Verify VAD thresholds and that
audioSessionEndis sent after speech - Frequent GOAWAY on idle: Acceptable; ensure executions are closed and executors are reused
- gRPC
Deadline Exceeded: Single execution timed out; treat as recoverable (retry once) - HTTP/2
GOAWAY ENHANCE_YOUR_CALM(too_many_pings): Server throttling of idle keepalives; treat as recoverable, rebuild channel/executor on next use - WebSocket "closed without close handshake": Usually process restart/crash or proxy idle-kill; implement client auto‑reconnect (backoff)
- "Your graph is not registered": Informational warning about remote variants; safe to ignore unless you explicitly use registry-managed graphs
Bug Reports: GitHub Issues
General Questions: For general inquiries and support, please email us at support@inworld.ai
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.
This project is licensed under the MIT License - see the LICENSE file for details.