Skip to content

NikitosKh/mini-specforge

Repository files navigation

spec-forge-mini

Minimal Eagle3 draft model trainer. Train speculative decoding models. The repo is designed to be a small, readable yet performant version of SpecForge by sgl-project. I use a HuggingFace backend which is comparable to SGLang prefill kernel performance, abstracting away getting logits with the idea of being inference engine and in future hardware agnostic.

Supported models: Qwen3-8B, Qwen3.5-9B.

Screenshot 2026-03-07 at 23 03 37

Quick Start

# 1. Prepare data (single dataset)
uv run python scripts/prepare_data.py sharegpt

# 2. Train (single GPU)
uv run torchrun --standalone --nproc_per_node 1 train.py configs/train_qwen3_8b.json

Training with Mixed Data

For higher quality draft models, train on multiple diverse datasets:

# 1. Download all datasets
uv run python scripts/prepare_data.py --all

# Or download individually:
for ds in sharegpt ultrachat openhermes slimorca wildchat magpie codefeedback mathinstruct wizardlm; do
    uv run python scripts/prepare_data.py $ds
done

# 2. Train on 7 GPUs (keep 1 for evals)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 uv run torchrun \
    --standalone --nproc_per_node 7 \
    train.py configs/train_qwen3_5_9b_mixed.json

Available datasets: sharegpt, ultrachat, eaglechat, wildchat, lmsys_chat, openhermes, slimorca, magpie, wizardlm, codefeedback, mathinstruct.

Data weights can be configured in the training config (e.g., 1.5x for code/math).

Serving with SGLang

Requires SGLang with Eagle3 support for the target model. For Qwen3.5, use this fork (PR #20104):

# Install SGLang (tested with v0.5.6.post2 / flashinfer 0.6.4)
git clone --branch feat/qwen3_5-eagle3 https://github.com/NikitosKh/sglang.git
pip install "./sglang/python[all]"
pip install flashinfer_python==0.6.4 flashinfer_cubin==0.6.4

# Serve Qwen3.5-9B with Eagle3
SGLANG_DISABLE_CUDNN_CHECK=1 python -m sglang.launch_server \
    --model Qwen/Qwen3.5-9B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --trust-remote-code \
    --mamba-scheduler-strategy extra_buffer \
    --mem-fraction-static 0.85 \
    --cuda-graph-max-bs 8

# Serve Qwen3-8B (supported in upstream SGLang)
python -m sglang.launch_server \
    --model Qwen/Qwen3-8B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path outputs/qwen3-8b-eagle3/final

Evaluation

Evaluate draft model quality by measuring throughput speedup over baseline autoregressive decoding. The eval script follows the methodology from the Eagle3 paper (speedup ratio, accept length) using prompts across 5 domains (translation, summarization, QA, code, math/reasoning) plus MT-bench.

Prerequisites

  1. SGLang installed with Eagle3 support (see "Serving with SGLang" above)
  2. A trained checkpoint
  3. A free GPU for evaluation

Running Evaluations

Automated (launches servers, runs benchmarks, produces report):

uv run python scripts/evaluate_eagle3.py \
    --target-model Qwen/Qwen3.5-9B \
    --draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
    --sglang-repo /path/to/sglang \
    --gpu-id 7

Manual (if you already have SGLang servers running):

# Terminal 1: Launch baseline server (no draft model)
CUDA_VISIBLE_DEVICES=7 SGLANG_DISABLE_CUDNN_CHECK=1 \
    PYTHONPATH=/path/to/sglang/python:$PYTHONPATH \
    python -m sglang.launch_server \
    --model Qwen/Qwen3.5-9B --port 30001 --trust-remote-code \
    --mem-fraction-static 0.85 --cuda-graph-max-bs 8

# Terminal 2: Launch Eagle3 server
CUDA_VISIBLE_DEVICES=7 SGLANG_DISABLE_CUDNN_CHECK=1 \
    PYTHONPATH=/path/to/sglang/python:$PYTHONPATH \
    python -m sglang.launch_server \
    --model Qwen/Qwen3.5-9B --port 30000 --trust-remote-code \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path outputs/qwen3_5-9b-eagle3-mixed/final \
    --speculative-num-steps 3 --speculative-eagle-topk 4 \
    --speculative-num-draft-tokens 16 \
    --mamba-scheduler-strategy extra_buffer \
    --mem-fraction-static 0.85 --cuda-graph-max-bs 8

# Terminal 3: Run benchmark
uv run python scripts/evaluate_eagle3.py \
    --benchmark-only \
    --baseline-url http://localhost:30001 \
    --eagle3-url http://localhost:30000 \
    --target-model Qwen/Qwen3.5-9B \
    --output outputs/eval_report.json

Eval Options

Flag Default Description
--temperatures 0.0 Temperature(s) to test (e.g., 0.0 1.0)
--max-tokens 512 Max generation tokens per request
--no-mt-bench off Skip MT-bench, use only Spec-Bench-style prompts
--warmup 3 Number of warmup requests before measurement
--output outputs/eval_report.json Path for JSON report

Metrics

  • Throughput (tok/s): avg, median, p90
  • Speedup ratio: Eagle3 / baseline throughput
  • Time-to-first-token (TTFT)
  • Per-domain breakdown: translation, summarization, QA, code, math/reasoning
  • Accept length: visible in SGLang server logs during evaluation

About

Minimal Eagle3 draft model trainer for speculative decoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors