Note
This is a case study repository. Source code is maintained in a private repository and is not available for public access. This repo documents the project's architecture, features, and technical decisions for portfolio purposes.
Build voice-interactive virtual streamers with long-term memory, emotion expression, and real-time Live2D animation control.
Norgo is an AI VTuber system inspired by Neuro-sama. Talk to a virtual character through voice — it listens, thinks, responds with speech, and expresses emotions through Live2D in real time.
Built as a graduation project at National Yunlin University of Science and Technology (Information Management). Advised by Dr. Chao-Yi Huang.
- Full voice loop — Voice in, voice out. VAD-based silence detection, OpenAI Whisper transcription, Cartesia Sonic speech synthesis with 40ms time-to-first-byte.
- Long-term memory — Conversations are summarized and stored in a FAISS vector database. The system recalls relevant context from past sessions using a custom probability-weighted retrieval engine.
- Emotion-driven animation — LLM outputs emotion labels with every response. A continuous emotion state machine drives Live2D expression changes through Perlin noise and easing transitions.
- BreezyVoice support — Integrated MediaTek's open-source Taiwanese Mandarin TTS with prompt caching for voice cloning. This work spawned BreezyVoiceX.
- Multi-platform output — One core engine simultaneously drives Discord Bot (text + voice), VTube Studio (Live2D), and OBS Studio (stream overlay).
- Event-driven & modular — All modules communicate via publish-subscribe events. Adding a new platform requires zero changes to core logic.
- Configurable characters — Define personality, communication style, and behavior constraints in a single JSON file.
The system follows an event-driven design where voice input, LLM inference, speech synthesis, and animation updates run concurrently through async event dispatch. A single voice interaction flows through:
STT (silence detection + transcription) → RAG (memory retrieval) → LLM (response + emotion tagging) → in parallel: TTS (speech) + VTS (animation) → output
| Category | Technologies |
|---|---|
| Language | Python 3.11+ |
| LLM | OpenAI API (GPT-4o-mini), structured JSON output |
| STT | OpenAI Whisper, WebRTC VAD |
| TTS | Cartesia Sonic / BreezyVoice (voice cloning) |
| Memory | FAISS, Jina Embeddings v3 |
| Animation | VTube Studio API, Live2D, Perlin noise |
| Platform | Discord.py, OBS WebSocket |
| Async | asyncio, ThreadPoolExecutor |
System Architect & Core Developer in a team of 3, over 14 months.
- Designed the overall system architecture (event-driven, modular, fully async)
- Built the LLM integration layer with JSON-mode structured output
- Built the voice interaction pipeline (VAD recording → STT → conversation processing)
- Integrated BreezyVoice with prompt caching for voice cloning
- Designed the event listener system for decoupled module communication
- BreezyVoiceX — Open-source TTS enhancement that originated from this project
Feel free to reach out via email for technical details.

