Minimal Eagle3 draft model trainer. Train speculative decoding models. The repo is designed to be a small, readable yet performant version of SpecForge by sgl-project. I use a HuggingFace backend which is comparable to SGLang prefill kernel performance, abstracting away getting logits with the idea of being inference engine and in future hardware agnostic.
Supported models: Qwen3-8B, Qwen3.5-9B.
# 1. Prepare data (single dataset)
uv run python scripts/prepare_data.py sharegpt
# 2. Train (single GPU)
uv run torchrun --standalone --nproc_per_node 1 train.py configs/train_qwen3_8b.jsonFor higher quality draft models, train on multiple diverse datasets:
# 1. Download all datasets
uv run python scripts/prepare_data.py --all
# Or download individually:
for ds in sharegpt ultrachat openhermes slimorca wildchat magpie codefeedback mathinstruct wizardlm; do
uv run python scripts/prepare_data.py $ds
done
# 2. Train on 7 GPUs (keep 1 for evals)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 uv run torchrun \
--standalone --nproc_per_node 7 \
train.py configs/train_qwen3_5_9b_mixed.jsonAvailable datasets: sharegpt, ultrachat, eaglechat, wildchat, lmsys_chat, openhermes, slimorca, magpie, wizardlm, codefeedback, mathinstruct.
Data weights can be configured in the training config (e.g., 1.5x for code/math).
Requires SGLang with Eagle3 support for the target model. For Qwen3.5, use this fork (PR #20104):
# Install SGLang (tested with v0.5.6.post2 / flashinfer 0.6.4)
git clone --branch feat/qwen3_5-eagle3 https://github.com/NikitosKh/sglang.git
pip install "./sglang/python[all]"
pip install flashinfer_python==0.6.4 flashinfer_cubin==0.6.4
# Serve Qwen3.5-9B with Eagle3
SGLANG_DISABLE_CUDNN_CHECK=1 python -m sglang.launch_server \
--model Qwen/Qwen3.5-9B \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
--speculative-num-steps 3 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--trust-remote-code \
--mamba-scheduler-strategy extra_buffer \
--mem-fraction-static 0.85 \
--cuda-graph-max-bs 8
# Serve Qwen3-8B (supported in upstream SGLang)
python -m sglang.launch_server \
--model Qwen/Qwen3-8B \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path outputs/qwen3-8b-eagle3/finalEvaluate draft model quality by measuring throughput speedup over baseline autoregressive decoding. The eval script follows the methodology from the Eagle3 paper (speedup ratio, accept length) using prompts across 5 domains (translation, summarization, QA, code, math/reasoning) plus MT-bench.
- SGLang installed with Eagle3 support (see "Serving with SGLang" above)
- A trained checkpoint
- A free GPU for evaluation
Automated (launches servers, runs benchmarks, produces report):
uv run python scripts/evaluate_eagle3.py \
--target-model Qwen/Qwen3.5-9B \
--draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
--sglang-repo /path/to/sglang \
--gpu-id 7Manual (if you already have SGLang servers running):
# Terminal 1: Launch baseline server (no draft model)
CUDA_VISIBLE_DEVICES=7 SGLANG_DISABLE_CUDNN_CHECK=1 \
PYTHONPATH=/path/to/sglang/python:$PYTHONPATH \
python -m sglang.launch_server \
--model Qwen/Qwen3.5-9B --port 30001 --trust-remote-code \
--mem-fraction-static 0.85 --cuda-graph-max-bs 8
# Terminal 2: Launch Eagle3 server
CUDA_VISIBLE_DEVICES=7 SGLANG_DISABLE_CUDNN_CHECK=1 \
PYTHONPATH=/path/to/sglang/python:$PYTHONPATH \
python -m sglang.launch_server \
--model Qwen/Qwen3.5-9B --port 30000 --trust-remote-code \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
--speculative-num-steps 3 --speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mamba-scheduler-strategy extra_buffer \
--mem-fraction-static 0.85 --cuda-graph-max-bs 8
# Terminal 3: Run benchmark
uv run python scripts/evaluate_eagle3.py \
--benchmark-only \
--baseline-url http://localhost:30001 \
--eagle3-url http://localhost:30000 \
--target-model Qwen/Qwen3.5-9B \
--output outputs/eval_report.json| Flag | Default | Description |
|---|---|---|
--temperatures |
0.0 |
Temperature(s) to test (e.g., 0.0 1.0) |
--max-tokens |
512 |
Max generation tokens per request |
--no-mt-bench |
off | Skip MT-bench, use only Spec-Bench-style prompts |
--warmup |
3 |
Number of warmup requests before measurement |
--output |
outputs/eval_report.json |
Path for JSON report |
- Throughput (tok/s): avg, median, p90
- Speedup ratio: Eagle3 / baseline throughput
- Time-to-first-token (TTFT)
- Per-domain breakdown: translation, summarization, QA, code, math/reasoning
- Accept length: visible in SGLang server logs during evaluation