A Real-Time Voice Driven Operating System built with Streaming AI Pipelines
VoiceOS is a real-time voice-driven operating system interface that allows users to interact with their computer using natural language.
Unlike traditional assistants, VoiceOS is built as a modular AI system architecture combining:
β’ Streaming Speech Recognition β’ LLM Reasoning Engine β’ Event Driven System Architecture β’ Tool Execution Framework β’ Web Research Engine β’ OS Automation Layer β’ Real-Time Conversational Feedback
The system is designed to run fully locally on CPU hardware with optimized streaming pipelines.
Continuous speech recognition using streaming transcription.
Interprets commands, plans actions, and decides which tools to execute.
VoiceOS can search the web, analyze sources, and summarize insights.
Control your computer using natural voice commands.
Examples:
Open applications Type text Switch windows Clipboard automation
Back-channel listening makes the assistant behave more naturally.
Example:
User speaking...
VoiceOS: "mm-hmm"
VoiceOS: "I see"
The system follows an event-driven AI architecture.
graph TD
A[Microphone Input] --> B[Streaming Speech Recognition]
B --> C[LLM Reasoning Engine]
C --> D[Decision Engine]
D --> E[Tool Execution Layer]
E --> F[OS Automation Tools]
E --> G[Web Research Engine]
E --> H[Memory System]
C --> I[Streaming Text To Speech]
I --> J[Audio Response]
This architecture provides:
β’ modular components β’ asynchronous processing β’ easy extensibility
Traditional voice assistants run sequentially:
Speech β STT β LLM β TTS
VoiceOS uses a parallel streaming pipeline.
sequenceDiagram
participant User
participant STT
participant LLM
participant TTS
participant Speaker
User->>STT: Speak
STT-->>LLM: Partial Transcript
LLM-->>TTS: Token Stream
TTS-->>Speaker: Speech Output
This pipeline enables responses to begin before the full sentence is processed.
Average latency:
~600ms on CPU
Example interactions:
Open Chrome
Search for reinforcement learning robotics
Type hello world
Summarize latest AI research
Switch window
voice-os
backend
core
event_bus.py
events.py
stream_pipeline.py
stt
streaming_whisper.py
llm
streaming_llm.py
tts
streaming_tts.py
listener
backchannel_engine.py
interrupt
speech_state.py
tts_controller.py
research
web_search.py
summarizer.py
analysis_engine.py
tools
os_control
model_manager
hardware_detector.py
model_registry.py
model_downloader.py
model_manager.py
models
README.md
requirements.txt
VoiceOS uses optimized local AI models.
| Component | Model |
|---|---|
| Speech Recognition | Whisper Tiny |
| Reasoning | Mistral 7B (GGUF Quantized) |
| Speech Generation | Coqui Tacotron |
The Model Manager automatically:
β’ detects hardware β’ downloads models β’ selects optimal configuration
git clone https://github.com/yourusername/voice-os.git
cd voice-os
python -m venv venv
Activate:
Mac/Linux
source venv/bin/activate
Windows
venv\Scripts\activate
pip install -r requirements.txt
Main libraries include:
faster-whisper
llama-cpp-python
coqui-tts
beautifulsoup4
requests
pyautogui
psutil
python backend/main.py
First launch will:
β’ detect hardware β’ download AI models β’ initialize system components
Open Chrome
Search for latest reinforcement learning papers
Type hello world
Switch window
Summarize today's AI news
Central communication layer connecting all modules.
Handles:
β’ message routing β’ asynchronous events β’ module communication
Interprets commands and generates structured actions.
Example output:
{
intent: "open_application",
tool: "os_open_app",
parameters: {
app: "chrome"
}
}
Executes safe system operations.
Supported tools:
OS automation Web research Clipboard management Keyboard control
Every OS action requires confirmation.
This prevents:
unsafe automation destructive commands malicious behavior
Typical CPU laptop performance:
| Stage | Latency |
|---|---|
| Speech detection | 50ms |
| STT partial result | 200ms |
| LLM first token | 300ms |
| TTS playback | 200ms |
Total perceived latency:
~600ms
Planned improvements:
Multi-agent AI architecture Long-term memory graph Desktop GUI interface Autonomous task planning Distributed inference
Contributions are welcome.
Areas of interest:
AI agents speech processing system design low latency inference
MIT License
If you find this project interesting:
β Star the repository π΄ Fork the project π Build your own VoiceOS extensions
Built with β€οΈ for exploring the future of voice driven computing
