- Introduction
- Quick Start
- Usage
- Architecture Diagram
- Supported Operations - Practical Guide - Contributing
- License
voiceflow is an open-source project built with Go, designed to enable real-time voice interaction with Large Language Models (LLMs). By integrating various third-party voice platforms and local models, voiceflow supports real-time Speech-to-Text (STT), Text-to-Speech (TTS), and intelligent interaction with LLMs.
- Real-time Speech-to-Text (STT): Integrates with multiple cloud STT services (e.g., Azure, Google) and local models to convert user speech into text in real-time.
- LLM Interaction: Sends the recognized text directly to audio-capable LLMs to obtain intelligent responses.
- Text-to-Speech (TTS): Converts the LLM's text responses back into speech, supporting various TTS services (e.g., Azure, Google) and local models.
- Audio Storage & Access: Utilizes storage services like MinIO to store generated audio files and provide access URLs for real-time playback on the frontend.
- Pluggable Service Integration: Features a modular design allowing for pluggable integration of different STT, TTS services, and LLMs, facilitating easy extension and customization. 🎉
-
Clone the Repository
git clone https://github.com/telepace/voiceflow.git cd voiceflow -
Install Dependencies
Ensure you have Go 1.16 or higher installed.
go mod tidy
-
Copy the Example Environment File
cp configs/.env.example configs/.env
Edit the
.envfile and fill in the appropriate configuration values:# Example Environment Variables MINIO_ENDPOINT=play.min.io # Your MinIO server endpoint MINIO_ACCESS_KEY=youraccesskey # Your MinIO access key MINIO_SECRET_KEY=yoursecretkey # Your MinIO secret key AZURE_STT_KEY=yourazuresttkey # Your Azure Speech-to-Text service key AZURE_TTS_KEY=yourazurettskey # Your Azure Text-to-Speech service key # Add other necessary keys (e.g., Google Cloud, OpenAI API keys) as needed
-
Configure
config.yamlEdit
configs/config.yamlaccording to your project requirements:server: port: 8080 # Port the server will listen on enable_tls: false # Set to true to enable TLS/SSL minio: enabled: true # Set to true to enable MinIO storage bucket_name: voiceflow-audio # Name of the MinIO bucket for audio files stt: # Speech-to-Text Configuration provider: azure # Options: azure, google, local (choose your STT provider) # Add provider-specific settings here if needed tts: # Text-to-Speech Configuration provider: google # Options: azure, google, local (choose your TTS provider) # Add provider-specific settings here if needed llm: # Large Language Model Configuration provider: openai # Options: openai, local (choose your LLM provider) # Add provider-specific settings here (e.g., API key, model name) logging: level: info # Logging level (e.g., debug, info, warn, error)
Run the following command in the project root directory:
go run cmd/main.go
Check if the service has started correctly by accessing http://localhost:8080 (or your configured port).
graph TD
A["Frontend (Browser)"] --> B["WebSocket Server (Go Backend)"]
B --> C["Speech-to-Text (STT) Module"]
C --> D["Large Language Model (LLM) Module"]
D --> E["Text-to-Speech (TTS) Module"]
E --> F["Storage Service (e.g., MinIO)"]
F --> B ["Provides Audio URL"]
B --> A ["Sends Audio URL/Data"]
- Frontend (Browser): The user records voice input via the browser, sending audio data through a WebSocket connection to the server.
- WebSocket Server: Receives audio data from the frontend and orchestrates the workflow between different service modules.
- Speech-to-Text (STT) Module: Converts the incoming audio data into text.
- Large Language Model (LLM) Module: Processes the text from STT and generates an intelligent response.
- Text-to-Speech (TTS) Module: Converts the LLM's text response back into audio data.
- Storage Service (MinIO): Stores the generated audio files and provides accessible URLs for playback.
voiceflow/
├── cmd/
│ └── main.go # Application entry point
├── configs/
│ ├── config.yaml # Business logic configuration file
│ └── .env # Environment variables file (sensitive keys, etc.)
├── internal/
│ ├── config/ # Configuration loading module
│ ├── server/ # WebSocket server implementation
│ ├── stt/ # Speech-to-Text module (interfaces, implementations)
│ ├── tts/ # Text-to-Speech module (interfaces, implementations)
│ ├── llm/ # LLM interaction module (interfaces, implementations)
│ ├── storage/ # Storage module (interfaces, implementations like MinIO)
│ ├── models/ # Data models/structs used across the application
│ └── utils/ # Utility functions
├── pkg/
│ └── logger/ # Logging module setup
├── scripts/ # Build and deployment scripts (if any)
├── go.mod # Go modules file (dependencies)
├── go.sum # Go modules checksum file
└── README.md # Project description (this file)
- WebSocket Server
- Implemented using
gorilla/websocket. - Handles real-time communication with the frontend, receiving audio data and sending back processing results (like audio URLs).
- Implemented using
- Speech-to-Text (STT)
- Interface Definition:
internal/stt/stt.godefines the standard interface for STT services. - Pluggable Implementations: Supports various providers like Azure, Google Cloud Speech, and potentially local models. New providers can be added by implementing the interface.
- Interface Definition:
- Text-to-Speech (TTS)
- Interface Definition:
internal/tts/tts.godefines the standard interface for TTS services. - Pluggable Implementations: Supports various providers like Azure, Google Cloud Text-to-Speech, and potentially local models.
- Interface Definition:
- Large Language Model (LLM)
- Interface Definition:
internal/llm/llm.godefines the interface for interacting with LLMs. - Pluggable Implementations: Supports providers like OpenAI (GPT models) and potentially local LLMs.
- Interface Definition:
- Storage Module
- Interface Definition:
internal/storage/storage.godefines the interface for storage services. - Implementation: Defaults to using MinIO for object storage (ideal for audio files) but can be adapted to use local file systems or other cloud storage providers.
- Interface Definition:
- Implement a Message Bus (e.g., Kafka, NATS) for better decoupling between services.
- Integrate a Configuration Center (e.g., Consul, etcd) for dynamic configuration management.
- Provide Containerized Deployment options (Dockerfile, docker-compose.yaml).
- Implement Hooks/Callbacks for extending functionality at various stages of the pipeline.
We welcome contributions of any kind! Please read CONTRIBUTING.md (if available, otherwise follow standard GitHub practices) for more information.
- Reporting Issues: If you find a bug or have a feature suggestion, please submit an issue on GitHub.
- Contributing Code: Fork the repository, make your changes on a separate branch, and submit a Pull Request.
voiceflow is licensed under the Apache License 2.0.
Thank you to all the developers who have contributed to this project!