Norgo: Voice-Driven AI VTuber System

Note

This is a case study repository. Source code is maintained in a private repository and is not available for public access. This repo documents the project's architecture, features, and technical decisions for portfolio purposes.

Build voice-interactive virtual streamers with long-term memory, emotion expression, and real-time Live2D animation control.

What is Norgo?

Norgo is an AI VTuber system inspired by Neuro-sama. Talk to a virtual character through voice — it listens, thinks, responds with speech, and expresses emotions through Live2D in real time.

Built as a graduation project at National Yunlin University of Science and Technology (Information Management). Advised by Dr. Chao-Yi Huang.

Features

Full voice loop — Voice in, voice out. VAD-based silence detection, OpenAI Whisper transcription, Cartesia Sonic speech synthesis with 40ms time-to-first-byte.
Long-term memory — Conversations are summarized and stored in a FAISS vector database. The system recalls relevant context from past sessions using a custom probability-weighted retrieval engine.
Emotion-driven animation — LLM outputs emotion labels with every response. A continuous emotion state machine drives Live2D expression changes through Perlin noise and easing transitions.
BreezyVoice support — Integrated MediaTek's open-source Taiwanese Mandarin TTS with prompt caching for voice cloning. This work spawned BreezyVoiceX.
Multi-platform output — One core engine simultaneously drives Discord Bot (text + voice), VTube Studio (Live2D), and OBS Studio (stream overlay).
Event-driven & modular — All modules communicate via publish-subscribe events. Adding a new platform requires zero changes to core logic.
Configurable characters — Define personality, communication style, and behavior constraints in a single JSON file.

Architecture

The system follows an event-driven design where voice input, LLM inference, speech synthesis, and animation updates run concurrently through async event dispatch. A single voice interaction flows through:

STT (silence detection + transcription) → RAG (memory retrieval) → LLM (response + emotion tagging) → in parallel: TTS (speech) + VTS (animation) → output

Tech Stack

Category	Technologies
Language	Python 3.11+
LLM	OpenAI API (GPT-4o-mini), structured JSON output
STT	OpenAI Whisper, WebRTC VAD
TTS	Cartesia Sonic / BreezyVoice (voice cloning)
Memory	FAISS, Jina Embeddings v3
Animation	VTube Studio API, Live2D, Perlin noise
Platform	Discord.py, OBS WebSocket
Async	asyncio, ThreadPoolExecutor

My Role

System Architect & Core Developer in a team of 3, over 14 months.

Designed the overall system architecture (event-driven, modular, fully async)
Built the LLM integration layer with JSON-mode structured output
Built the voice interaction pipeline (VAD recording → STT → conversation processing)
Integrated BreezyVoice with prompt caching for voice cloning
Designed the event listener system for decoupled module communication

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
.gitattributes		.gitattributes
README.md		README.md
README.zh-TW.md		README.zh-TW.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Norgo: Voice-Driven AI VTuber System

What is Norgo?

Features

Architecture

Tech Stack

My Role

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Norgo: Voice-Driven AI VTuber System

What is Norgo?

Features

Architecture

Tech Stack

My Role

Related

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages