MOSS-TTS-Delay: llama.cpp Inference Backend

This package provides a torch-free (or torch-optional) end-to-end TTS inference pipeline for MOSS-TTS-Delay using:

llama.cpp for the Qwen3 backbone (GGUF format, GPU/CPU)
NumPy for embeddings, LM heads, delay state machine, and sampling
ONNX Runtime or TensorRT for the audio tokenizer

When PyTorch is available, LM heads can optionally be GPU-accelerated (~30x faster).

Prerequisites

llama.cpp — compiled from source with shared library support
Python >= 3.10

Installation

Minimal (torch-free, ONNX audio)

pip install -e ".[llama-cpp-onnx]"

With TensorRT audio (max performance)

pip install -e ".[llama-cpp-trt]"

With PyTorch LM heads acceleration

pip install -e ".[llama-cpp-trt,llama-cpp-torch]"

Weight Preparation

To convert weights from the original MOSS-TTS model yourself (instead of downloading pre-quantized ones), see the conversion guide.

Step 1: Download pre-quantized TTS backbone & weights

We provide pre-quantized GGUF backbone, embedding tables, and LM head matrices on HuggingFace:

# Download pre-built GGUF + embeddings + lm_heads
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF

This gives you:

weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf — Q4_K_M quantized backbone
weights/MOSS-TTS-GGUF/embeddings/ — 33 embedding .npy files
weights/MOSS-TTS-GGUF/lm_heads/ — 33 LM head .npy files
weights/MOSS-TTS-GGUF/tokenizer/ — BPE tokenizer files

Step 2: Download ONNX audio tokenizer

We provide ONNX models for the audio tokenizer. TensorRT engines are not provided because they are tied to specific GPU architectures and TensorRT versions.

# Download ONNX encoder & decoder
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX

Step 3: Build the C bridge

# Clone and build llama.cpp (if not already done)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j
cd ..

# Build the C bridge shared library
cd moss_tts_delay/llama_cpp
bash build_bridge.sh /path/to/llama.cpp

Step 4 (Optional): Build TensorRT engines

Note: Only needed if you want to use audio_backend: trt for maximum audio tokenizer performance. Most users should use the ONNX backend.

bash moss_audio_tokenizer/trt/build_engine.sh \
    weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \
    weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \
    weights/MOSS-Audio-Tokenizer-TRT

⚠️ maxShapes determines the maximum audio length your engine can handle. The default builds support up to 40 seconds of audio. If you need longer audio, edit MAX_AUDIO_SECONDS in build_engine.sh before building. See the detailed shape ↔ duration table in the script's comments.

Usage

CLI

# Basic generation
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello, world!" \
    --output output.wav

# With reference audio (voice cloning)
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --reference ref.wav \
    --output output.wav

# Force numpy LM heads (torch-free)
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --heads-backend numpy

# With profiling
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --profile

Python API

from moss_tts_delay.llama_cpp import LlamaCppPipeline, PipelineConfig

config = PipelineConfig.from_yaml("configs/llama_cpp/default.yaml")

with LlamaCppPipeline(config) as pipeline:
    waveform = pipeline.generate(
        text="Hello, world!",
        reference_audio="ref.wav",  # optional
        language="en",
    )

import soundfile as sf
sf.write("output.wav", waveform, 24000)

Batch Evaluation

python scripts/batch_eval_llama_cpp.py \
    --config configs/llama_cpp/default.yaml \
    --benchmark-dir /path/to/eval/tts \
    --result-dir results/llama_cpp_run \
    --suite seed-tts

Benchmark

Quantization quality evaluated on Seed-TTS-eval zero-shot benchmark. Baseline is the original HuggingFace model; GGUF variants use the llama.cpp backend with TensorRT audio tokenizer.

Quantization	EN WER (%) ↓	EN SIM (%) ↑	ZH CER (%) ↓	ZH SIM (%) ↑
Baseline (HuggingFace)	1.79	71.46	1.32	77.05
Q8_0	3.21	68.61	1.56	76.03
Q6_K	3.11	68.77	1.44	76.06
Q5_K_M	2.95	68.55	1.50	75.96
Q4_K_M	2.83	68.15	1.58	75.71

Configuration

Config Files

Config	Audio Backend	Use Case
`configs/llama_cpp/default.yaml`	ONNX	Recommended starting point
`configs/llama_cpp/trt.yaml`	TensorRT	Maximum throughput
`configs/llama_cpp/cpu-only.yaml`	ONNX (CPU)	No GPU required

Key Options

Option	Values	Description
`heads_backend`	`auto` / `numpy` / `torch`	LM heads computation backend. `auto` uses torch if available
`audio_backend`	`onnx` / `trt` / `torch`	Audio tokenizer backend
`n_gpu_layers`	`-1` / `0` / `N`	GPU offload layers. -1 = all, 0 = CPU only
`n_ctx`	int	Context window size (prompt + generation)
`max_new_tokens`	int	Maximum generation steps

Architecture

Input text
  │
  ▼
Tokenizer (Rust BPE, tokenizers library)
  │
  ▼
build_generation_prompt() → input_ids (S, 33)
  │
  ▼
EmbeddingLookup (NumPy .npy) → embeddings (S, H)
  │
  ▼
LlamaCppBackbone (GGUF, C bridge) → hidden_state (H,)
  │
  ├─ [heads_backend=torch] TorchLMHeads (nn.Linear, GPU)
  │                          └─ audio_logits (32, 1025)
  │
  └─ [heads_backend=numpy] NumpyLMHeads (CPU matmul)
                             └─ audio_logits (32, 1025)
  │
  ▼
delay_step() + sampling (NumPy) → next_ids (33,)
  │
  ▼ (loop until EOS)
  │
Audio codes → AudioTokenizer (ONNX/TRT/Torch) → waveform

File Structure

moss_tts_delay/llama_cpp/
├── __init__.py          # Package entry, exports LlamaCppPipeline
├── __main__.py          # python -m moss_tts_delay.llama_cpp
├── _constants.py        # Token IDs (from config.json, torch-free)
├── pipeline.py          # LlamaCppPipeline (main entry)
├── backbone.py          # LlamaCppBackbone (C bridge wrapper)
├── backbone_bridge.c    # C bridge source
├── build_bridge.sh      # Build script
├── embedding.py         # EmbeddingLookup (NumPy)
├── lm_heads.py          # NumpyLMHeads + TorchLMHeads
├── delay_state.py       # Delay state machine (NumPy)
├── sampling.py          # top-k/p sampling (NumPy)
├── processor.py         # Tokenizer + prompt builder
├── README.md            # This file
├── README_zh.md         # Chinese documentation
└── conversion/
    ├── extract_weights.py  # Weight extraction script
    ├── README.md           # Conversion guide (English)
    └── README_zh.md        # Conversion guide (Chinese)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOSS-TTS-Delay: llama.cpp Inference Backend

Prerequisites

Installation

Minimal (torch-free, ONNX audio)

With TensorRT audio (max performance)

With PyTorch LM heads acceleration

Weight Preparation

Step 1: Download pre-quantized TTS backbone & weights

Step 2: Download ONNX audio tokenizer

Step 3: Build the C bridge

Step 4 (Optional): Build TensorRT engines

Usage

CLI

Python API

Batch Evaluation

Benchmark

Configuration

Config Files

Key Options

Architecture

File Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MOSS-TTS-Delay: llama.cpp Inference Backend

Prerequisites

Installation

Minimal (torch-free, ONNX audio)

With TensorRT audio (max performance)

With PyTorch LM heads acceleration

Weight Preparation

Step 1: Download pre-quantized TTS backbone & weights

Step 2: Download ONNX audio tokenizer

Step 3: Build the C bridge

Step 4 (Optional): Build TensorRT engines

Usage

CLI

Python API

Batch Evaluation

Benchmark

Configuration

Config Files

Key Options

Architecture

File Structure