spec-forge-mini

Minimal Eagle3 draft model trainer. Train speculative decoding models. The repo is designed to be a small, readable yet performant version of SpecForge by sgl-project. I use a HuggingFace backend which is comparable to SGLang prefill kernel performance, abstracting away getting logits with the idea of being inference engine and in future hardware agnostic.

Supported models: Qwen3-8B, Qwen3.5-9B.

Quick Start

# 1. Prepare data (single dataset)
uv run python scripts/prepare_data.py sharegpt

# 2. Train (single GPU)
uv run torchrun --standalone --nproc_per_node 1 train.py configs/train_qwen3_8b.json

Training with Mixed Data

For higher quality draft models, train on multiple diverse datasets:

# 1. Download all datasets
uv run python scripts/prepare_data.py --all

# Or download individually:
for ds in sharegpt ultrachat openhermes slimorca wildchat magpie codefeedback mathinstruct wizardlm; do
    uv run python scripts/prepare_data.py $ds
done

# 2. Train on 7 GPUs (keep 1 for evals)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 uv run torchrun \
    --standalone --nproc_per_node 7 \
    train.py configs/train_qwen3_5_9b_mixed.json

Available datasets: sharegpt, ultrachat, eaglechat, wildchat, lmsys_chat, openhermes, slimorca, magpie, wizardlm, codefeedback, mathinstruct.

Data weights can be configured in the training config (e.g., 1.5x for code/math).

Serving with SGLang

Requires SGLang with Eagle3 support for the target model. For Qwen3.5, use this fork (PR #20104):

# Install SGLang (tested with v0.5.6.post2 / flashinfer 0.6.4)
git clone --branch feat/qwen3_5-eagle3 https://github.com/NikitosKh/sglang.git
pip install "./sglang/python[all]"
pip install flashinfer_python==0.6.4 flashinfer_cubin==0.6.4

# Serve Qwen3.5-9B with Eagle3
SGLANG_DISABLE_CUDNN_CHECK=1 python -m sglang.launch_server \
    --model Qwen/Qwen3.5-9B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --trust-remote-code \
    --mamba-scheduler-strategy extra_buffer \
    --mem-fraction-static 0.85 \
    --cuda-graph-max-bs 8

# Serve Qwen3-8B (supported in upstream SGLang)
python -m sglang.launch_server \
    --model Qwen/Qwen3-8B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path outputs/qwen3-8b-eagle3/final

Evaluation

Evaluate draft model quality by measuring throughput speedup over baseline autoregressive decoding. The eval script follows the methodology from the Eagle3 paper (speedup ratio, accept length) using prompts across 5 domains (translation, summarization, QA, code, math/reasoning) plus MT-bench.

Prerequisites

SGLang installed with Eagle3 support (see "Serving with SGLang" above)
A trained checkpoint
A free GPU for evaluation

Running Evaluations

Automated (launches servers, runs benchmarks, produces report):

uv run python scripts/evaluate_eagle3.py \
    --target-model Qwen/Qwen3.5-9B \
    --draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
    --sglang-repo /path/to/sglang \
    --gpu-id 7

Manual (if you already have SGLang servers running):

# Terminal 1: Launch baseline server (no draft model)
CUDA_VISIBLE_DEVICES=7 SGLANG_DISABLE_CUDNN_CHECK=1 \
    PYTHONPATH=/path/to/sglang/python:$PYTHONPATH \
    python -m sglang.launch_server \
    --model Qwen/Qwen3.5-9B --port 30001 --trust-remote-code \
    --mem-fraction-static 0.85 --cuda-graph-max-bs 8

# Terminal 2: Launch Eagle3 server
CUDA_VISIBLE_DEVICES=7 SGLANG_DISABLE_CUDNN_CHECK=1 \
    PYTHONPATH=/path/to/sglang/python:$PYTHONPATH \
    python -m sglang.launch_server \
    --model Qwen/Qwen3.5-9B --port 30000 --trust-remote-code \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
    --speculative-num-steps 3 --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --mamba-scheduler-strategy extra_buffer \
    --mem-fraction-static 0.85 --cuda-graph-max-bs 8

# Terminal 3: Run benchmark
uv run python scripts/evaluate_eagle3.py \
    --benchmark-only \
    --baseline-url http://localhost:30001 \
    --eagle3-url http://localhost:30000 \
    --target-model Qwen/Qwen3.5-9B \
    --output outputs/eval_report.json

Eval Options

Flag	Default	Description
`--temperatures`	`0.0`	Temperature(s) to test (e.g., `0.0 1.0`)
`--max-tokens`	`512`	Max generation tokens per request
`--no-mt-bench`	off	Skip MT-bench, use only Spec-Bench-style prompts
`--warmup`	`3`	Number of warmup requests before measurement
`--output`	`outputs/eval_report.json`	Path for JSON report

Metrics

Throughput (tok/s): avg, median, p90
Speedup ratio: Eagle3 / baseline throughput
Time-to-first-token (TTFT)
Per-domain breakdown: translation, summarization, QA, code, math/reasoning
Accept length: visible in SGLang server logs during evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
examples		examples
scripts		scripts
specforge_mini		specforge_mini
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
run_server.sh		run_server.sh
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spec-forge-mini

Quick Start

Training with Mixed Data

Serving with SGLang

Evaluation

Prerequisites

Running Evaluations

Eval Options

Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spec-forge-mini

Quick Start

Training with Mixed Data

Serving with SGLang

Evaluation

Prerequisites

Running Evaluations

Eval Options

Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages