Skip to content

eryawww/kokoro_hack

Repository files navigation

Kokoro Style Optimization

This project implements Style Optimization for the Kokoro-82M TTS model. It allows you to generate speech with specific emotional tones (e.g., anger, happiness) by optimizing the style vector during inference.

Features

  • Emotion Steering: Generate speech that matches a target emotion embedding.
  • Optimization Method: Uses Particle Swarm Optimization (PSO) to search for the best style vector.
  • Dual Output: Generates both a baseline (neutral) version and the steered (emotional) version for comparison.

Installation

  1. git clone https://github.com/eryawww/kokoro_hack.git

  2. pip install -r requirements.txt

Usage

The main entry point is main.py.

Basic Usage

Generate audio with a specific emotion. This will output two files:

  1. I am very angry right now!.wav (Baseline, Zero-shot)
  2. I am very angry right now!_anger.wav (Steered)
python main.py --text "I am very angry right now!" --emotion anger

Advanced Options

python main.py \
  --text "I am feeling very sleepy." \
  --emotion sleepiness \
  --iters 100 \
  --early_stopping 15 \
  --stft_loss_weight 0.5

Arguments

Argument Description Default
--text Text to speak (required). Used as the filename prefix. -
--emotion Target emotion. Supported: amused, anger, disgust, neutral, sleepiness. -
--iters Number of PSO iterations. 80
--early_stopping Stop optimization if no improvement for N iterations. 10
--stft_loss_weight Weight of the STFT loss component (vs Cosine similarity). Higher for more realistic audio. 0.7
--embedding_path Path to emotion embeddings file. per_emotion_embedding_centroid.pt

How It Works

  1. Emotion Encoder: Uses a pre-trained emotion encoder (Wav2Vec2-based) to extract embeddings from audio.
  2. Style Optimization:
    • The Kokoro model accepts a style vector.
    • PSO maintains a swarm of particles (style vectors) that explore the style space.
    • It minimizes a loss function combining Cosine Similarity (to the target emotion centroid) and Multi-Resolution STFT Loss (to preserve audio quality relative to the baseline).
  3. Output: Produces the final audio using the best found style vector.

Acknowledgements

  • Based on Kokoro-82M by hexgrad.
  • StyleTTS 2 architecture.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Contributors 25