Skip to content

yapit-tts/yapit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

828 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yapit: Listen to anything. Open-source TTS for documents, web pages, and text.

GitHub Repo stars CI/CD License: AGPL-3.0

image

Paste a URL or upload a PDF. Yapit renders the document and reads it aloud.

  • Handles the documents other TTS tools can't: academic papers with math, citations, figures, tables, messy formatting. Equations get spoken descriptions, citations become prose, page noise is skipped. The original content displays faithfully.
  • 170+ voices across 15 languages. Premium voices or free local synthesis that runs entirely in your browser, no account needed.
  • Vim-style keyboard shortcuts, document outliner, media key support, adjustable speed, dark mode, share by link.
  • Markdown export: append /md to any document URL to get clean markdown via curl. /md-annotated includes TTS annotations.

Powered by Gemini, Kokoro, Inworld TTS, DocLayout-YOLO, defuddle.

Self-hosting

git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit
cp .env.selfhost.example .env.selfhost # edit to enable optional features (AI-extraction, custom TTS models)
make self-host

Open http://localhost. Data persists across restarts. To stop: make self-host-down.

Multi-user mode

By default, yapit runs in single-user mode — no login required, all features unlocked. .env.selfhost is self-documenting — see the comments for optional features (AI extraction, custom TTS models).

If you want user accounts with login (e.g., for a family or small team), set AUTH_ENABLED=true in .env.selfhost, uncomment the Stack Auth section below it, and use make self-host-auth instead. This adds Stack Auth and ClickHouse containers. Note: in single-user mode, all requests share one user — everyone on the network sees the same document library.

Custom TTS voices

Use any server implementing the OpenAI /v1/audio/speech API (vLLM-Omni, Kokoro-FastAPI, AllTalk, Chatterbox TTS, etc.).

Add to .env.selfhost:

OPENAI_TTS_BASE_URL=http://your-tts-server:8091/v1
OPENAI_TTS_API_KEY=your-key-or-empty
OPENAI_TTS_MODEL=your-model-name

Voices are auto-discovered if the server supports GET /v1/audio/voices. Otherwise set OPENAI_TTS_VOICES=voice1,voice2,....

Example: OpenAI TTS

OpenAI doesn't support voice auto-discovery, so OPENAI_TTS_VOICES is required.

OPENAI_TTS_BASE_URL=https://api.openai.com/v1
OPENAI_TTS_API_KEY=sk-...
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICES=alloy,echo,fable,nova,onyx,shimmer
Example: Qwen3-TTS via vLLM-Omni

Requires GPU. The default stage config assumes >=16GB VRAM. For 8GB cards (e.g., RTX 3070 Ti), create a custom config with lower sequence lengths and memory utilization — see the stage config reference.

pip install vllm-omni
vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
    --omni --port 8091 --trust-remote-code --enforce-eager \
    --stage-configs-path /path/to/stage_configs.yaml # if you have low VRAM. `max_model_len: 1024` should work on 8GB

Then configure yapit:

OPENAI_TTS_BASE_URL=http://your-gpu-host:8091/v1
OPENAI_TTS_API_KEY=EMPTY
OPENAI_TTS_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Voices are auto-discovered from the server (9 built-in speakers for CustomVoice models).

AI document extraction

Vision-based PDF/image processing works with any OpenAI-compatible API.

Add to .env.selfhost:

AI_PROCESSOR=openai
AI_PROCESSOR_BASE_URL=https://openrouter.ai/api/v1  # or your vLLM/Ollama endpoint
AI_PROCESSOR_API_KEY=your-key
AI_PROCESSOR_MODEL=qwen/qwen3-vl-235b-a22b-instruct  # any vision-capable model

Or use Google Gemini directly (with batch-mode support): AI_PROCESSOR=gemini + GOOGLE_API_KEY=your-key.

GPU workers for Kokoro TTS & YOLO figure detection

Kokoro and YOLO run as pull-based workers — any machine with Redis access can join. Connect from the local network or via Tailscale. GPU and CPU workers run side-by-side; faster workers naturally pull more jobs. Scale by running more containers on any machine that can reach Redis.

Prereq: Docker 25+, nvidia-container-toolkit with CDI enabled, network access to the Redis instance.

# One-time GPU setup: generate CDI spec + enable CDI in Docker
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Add {"features": {"cdi": true}} to /etc/docker/daemon.json, then:
sudo systemctl restart docker

git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit

# Pull only the images you need
docker compose -f docker-compose.worker.yml pull kokoro-gpu yolo-gpu

# Start 2 Kokoro + 1 YOLO worker
REDIS_URL=redis://<host>:6379/0 docker compose -f docker-compose.worker.yml up -d \
  --scale kokoro-gpu=2 --scale yolo-gpu=1 kokoro-gpu yolo-gpu

Adjust --scale to your GPU. A 4GB card fits 2 Kokoro + 1 YOLO comfortably.

NVIDIA MPS (recommended for multiple workers per GPU)

MPS lets multiple workers share one GPU context — less VRAM overhead, no context switching. Without MPS, each worker gets its own CUDA context (~300MB each). The compose file mounts the MPS pipe automatically; just start the daemon.

sudo tee /etc/systemd/system/nvidia-mps.service > /dev/null <<'EOF'
[Unit]
Description=NVIDIA Multi-Process Service (MPS)
After=nvidia-persistenced.service

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-cuda-mps-control -d
ExecStop=/bin/sh -c 'echo quit | /usr/bin/nvidia-cuda-mps-control'
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-mps

Roadmap

Next:

  • Support exporting audio as MP3.
  • Support word-level highlighting for kokoro english

Later:

  • Support thinking parameter for Gemini
  • Support temperature parameter for Inworld
  • Support AI-transform for websites.

Development

uv sync                              # install Python dependencies
npm install --prefix frontend        # install frontend dependencies
make dev-env 2>/dev/null || touch .env  # decrypt secrets, or create empty .env
make dev-cpu                         # start backend services (Docker Compose)
cd frontend && npm run dev           # start frontend
make test-local                      # run tests

See agent/knowledge/dev-setup.md for full setup instructions.

The agent/knowledge/ directory is the project's in-depth knowledge base, maintained jointly with Claude during development.