Skip to content

Docat0209/case-ai-vtuber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

English | 繁體中文

Norgo: Voice-Driven AI VTuber System

Python OpenAI FAISS Live2D Discord License

Note

This is a case study repository. Source code is maintained in a private repository and is not available for public access. This repo documents the project's architecture, features, and technical decisions for portfolio purposes.

Build voice-interactive virtual streamers with long-term memory, emotion expression, and real-time Live2D animation control.

Demo

What is Norgo?

Norgo is an AI VTuber system inspired by Neuro-sama. Talk to a virtual character through voice — it listens, thinks, responds with speech, and expresses emotions through Live2D in real time.

Built as a graduation project at National Yunlin University of Science and Technology (Information Management). Advised by Dr. Chao-Yi Huang.

Features

  • Full voice loop — Voice in, voice out. VAD-based silence detection, OpenAI Whisper transcription, Cartesia Sonic speech synthesis with 40ms time-to-first-byte.
  • Long-term memory — Conversations are summarized and stored in a FAISS vector database. The system recalls relevant context from past sessions using a custom probability-weighted retrieval engine.
  • Emotion-driven animation — LLM outputs emotion labels with every response. A continuous emotion state machine drives Live2D expression changes through Perlin noise and easing transitions.
  • BreezyVoice support — Integrated MediaTek's open-source Taiwanese Mandarin TTS with prompt caching for voice cloning. This work spawned BreezyVoiceX.
  • Multi-platform output — One core engine simultaneously drives Discord Bot (text + voice), VTube Studio (Live2D), and OBS Studio (stream overlay).
  • Event-driven & modular — All modules communicate via publish-subscribe events. Adding a new platform requires zero changes to core logic.
  • Configurable characters — Define personality, communication style, and behavior constraints in a single JSON file.

Architecture

System Architecture

The system follows an event-driven design where voice input, LLM inference, speech synthesis, and animation updates run concurrently through async event dispatch. A single voice interaction flows through:

STT (silence detection + transcription) → RAG (memory retrieval) → LLM (response + emotion tagging) → in parallel: TTS (speech) + VTS (animation) → output

Tech Stack

Category Technologies
Language Python 3.11+
LLM OpenAI API (GPT-4o-mini), structured JSON output
STT OpenAI Whisper, WebRTC VAD
TTS Cartesia Sonic / BreezyVoice (voice cloning)
Memory FAISS, Jina Embeddings v3
Animation VTube Studio API, Live2D, Perlin noise
Platform Discord.py, OBS WebSocket
Async asyncio, ThreadPoolExecutor

My Role

System Architect & Core Developer in a team of 3, over 14 months.

  • Designed the overall system architecture (event-driven, modular, fully async)
  • Built the LLM integration layer with JSON-mode structured output
  • Built the voice interaction pipeline (VAD recording → STT → conversation processing)
  • Integrated BreezyVoice with prompt caching for voice cloning
  • Designed the event listener system for decoupled module communication

Related

  • BreezyVoiceX — Open-source TTS enhancement that originated from this project

Feel free to reach out via email for technical details.

About

Voice-driven AI VTuber system with long-term memory, emotion expression, and real-time Live2D animation control.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors