This project implements Style Optimization for the Kokoro-82M TTS model. It allows you to generate speech with specific emotional tones (e.g., anger, happiness) by optimizing the style vector during inference.
- Emotion Steering: Generate speech that matches a target emotion embedding.
- Optimization Method: Uses Particle Swarm Optimization (PSO) to search for the best style vector.
- Dual Output: Generates both a baseline (neutral) version and the steered (emotional) version for comparison.
-
git clone https://github.com/eryawww/kokoro_hack.git -
pip install -r requirements.txt
The main entry point is main.py.
Generate audio with a specific emotion. This will output two files:
I am very angry right now!.wav(Baseline, Zero-shot)I am very angry right now!_anger.wav(Steered)
python main.py --text "I am very angry right now!" --emotion angerpython main.py \
--text "I am feeling very sleepy." \
--emotion sleepiness \
--iters 100 \
--early_stopping 15 \
--stft_loss_weight 0.5| Argument | Description | Default |
|---|---|---|
--text |
Text to speak (required). Used as the filename prefix. | - |
--emotion |
Target emotion. Supported: amused, anger, disgust, neutral, sleepiness. |
- |
--iters |
Number of PSO iterations. | 80 |
--early_stopping |
Stop optimization if no improvement for N iterations. | 10 |
--stft_loss_weight |
Weight of the STFT loss component (vs Cosine similarity). Higher for more realistic audio. | 0.7 |
--embedding_path |
Path to emotion embeddings file. | per_emotion_embedding_centroid.pt |
- Emotion Encoder: Uses a pre-trained emotion encoder (Wav2Vec2-based) to extract embeddings from audio.
- Style Optimization:
- The Kokoro model accepts a
stylevector. - PSO maintains a swarm of particles (style vectors) that explore the style space.
- It minimizes a loss function combining Cosine Similarity (to the target emotion centroid) and Multi-Resolution STFT Loss (to preserve audio quality relative to the baseline).
- The Kokoro model accepts a
- Output: Produces the final audio using the best found style vector.
- Based on Kokoro-82M by hexgrad.
- StyleTTS 2 architecture.