This package provides a torch-free (or torch-optional) end-to-end TTS inference pipeline for MOSS-TTS-Delay using:
- llama.cpp for the Qwen3 backbone (GGUF format, GPU/CPU)
- NumPy for embeddings, LM heads, delay state machine, and sampling
- ONNX Runtime or TensorRT for the audio tokenizer
When PyTorch is available, LM heads can optionally be GPU-accelerated (~30x faster).
- llama.cpp — compiled from source with shared library support
- Python >= 3.10
pip install -e ".[llama-cpp-onnx]"pip install -e ".[llama-cpp-trt]"pip install -e ".[llama-cpp-trt,llama-cpp-torch]"To convert weights from the original MOSS-TTS model yourself (instead of downloading pre-quantized ones), see the conversion guide.
We provide pre-quantized GGUF backbone, embedding tables, and LM head matrices on HuggingFace:
# Download pre-built GGUF + embeddings + lm_heads
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUFThis gives you:
weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf— Q4_K_M quantized backboneweights/MOSS-TTS-GGUF/embeddings/— 33 embedding.npyfilesweights/MOSS-TTS-GGUF/lm_heads/— 33 LM head.npyfilesweights/MOSS-TTS-GGUF/tokenizer/— BPE tokenizer files
We provide ONNX models for the audio tokenizer. TensorRT engines are not provided because they are tied to specific GPU architectures and TensorRT versions.
# Download ONNX encoder & decoder
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX# Clone and build llama.cpp (if not already done)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j
cd ..
# Build the C bridge shared library
cd moss_tts_delay/llama_cpp
bash build_bridge.sh /path/to/llama.cppNote: Only needed if you want to use
audio_backend: trtfor maximum audio tokenizer performance. Most users should use the ONNX backend.
bash moss_audio_tokenizer/trt/build_engine.sh \
weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \
weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \
weights/MOSS-Audio-Tokenizer-TRT
⚠️ maxShapes determines the maximum audio length your engine can handle. The default builds support up to 40 seconds of audio. If you need longer audio, editMAX_AUDIO_SECONDSinbuild_engine.shbefore building. See the detailed shape ↔ duration table in the script's comments.
# Basic generation
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello, world!" \
--output output.wav
# With reference audio (voice cloning)
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello!" \
--reference ref.wav \
--output output.wav
# Force numpy LM heads (torch-free)
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello!" \
--heads-backend numpy
# With profiling
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello!" \
--profilefrom moss_tts_delay.llama_cpp import LlamaCppPipeline, PipelineConfig
config = PipelineConfig.from_yaml("configs/llama_cpp/default.yaml")
with LlamaCppPipeline(config) as pipeline:
waveform = pipeline.generate(
text="Hello, world!",
reference_audio="ref.wav", # optional
language="en",
)
import soundfile as sf
sf.write("output.wav", waveform, 24000)python scripts/batch_eval_llama_cpp.py \
--config configs/llama_cpp/default.yaml \
--benchmark-dir /path/to/eval/tts \
--result-dir results/llama_cpp_run \
--suite seed-ttsQuantization quality evaluated on Seed-TTS-eval zero-shot benchmark. Baseline is the original HuggingFace model; GGUF variants use the llama.cpp backend with TensorRT audio tokenizer.
| Quantization | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|---|---|---|---|---|
| Baseline (HuggingFace) | 1.79 | 71.46 | 1.32 | 77.05 |
| Q8_0 | 3.21 | 68.61 | 1.56 | 76.03 |
| Q6_K | 3.11 | 68.77 | 1.44 | 76.06 |
| Q5_K_M | 2.95 | 68.55 | 1.50 | 75.96 |
| Q4_K_M | 2.83 | 68.15 | 1.58 | 75.71 |
| Config | Audio Backend | Use Case |
|---|---|---|
configs/llama_cpp/default.yaml |
ONNX | Recommended starting point |
configs/llama_cpp/trt.yaml |
TensorRT | Maximum throughput |
configs/llama_cpp/cpu-only.yaml |
ONNX (CPU) | No GPU required |
| Option | Values | Description |
|---|---|---|
heads_backend |
auto / numpy / torch |
LM heads computation backend. auto uses torch if available |
audio_backend |
onnx / trt / torch |
Audio tokenizer backend |
n_gpu_layers |
-1 / 0 / N |
GPU offload layers. -1 = all, 0 = CPU only |
n_ctx |
int | Context window size (prompt + generation) |
max_new_tokens |
int | Maximum generation steps |
Input text
│
▼
Tokenizer (Rust BPE, tokenizers library)
│
▼
build_generation_prompt() → input_ids (S, 33)
│
▼
EmbeddingLookup (NumPy .npy) → embeddings (S, H)
│
▼
LlamaCppBackbone (GGUF, C bridge) → hidden_state (H,)
│
├─ [heads_backend=torch] TorchLMHeads (nn.Linear, GPU)
│ └─ audio_logits (32, 1025)
│
└─ [heads_backend=numpy] NumpyLMHeads (CPU matmul)
└─ audio_logits (32, 1025)
│
▼
delay_step() + sampling (NumPy) → next_ids (33,)
│
▼ (loop until EOS)
│
Audio codes → AudioTokenizer (ONNX/TRT/Torch) → waveform
moss_tts_delay/llama_cpp/
├── __init__.py # Package entry, exports LlamaCppPipeline
├── __main__.py # python -m moss_tts_delay.llama_cpp
├── _constants.py # Token IDs (from config.json, torch-free)
├── pipeline.py # LlamaCppPipeline (main entry)
├── backbone.py # LlamaCppBackbone (C bridge wrapper)
├── backbone_bridge.c # C bridge source
├── build_bridge.sh # Build script
├── embedding.py # EmbeddingLookup (NumPy)
├── lm_heads.py # NumpyLMHeads + TorchLMHeads
├── delay_state.py # Delay state machine (NumPy)
├── sampling.py # top-k/p sampling (NumPy)
├── processor.py # Tokenizer + prompt builder
├── README.md # This file
├── README_zh.md # Chinese documentation
└── conversion/
├── extract_weights.py # Weight extraction script
├── README.md # Conversion guide (English)
└── README_zh.md # Conversion guide (Chinese)