-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
🚀 The feature, motivation and pitch
Hi everyone -- I’d like to propose adding support for Fuzzy Speculative Decoding (FSD), a small extension to speculative decoding that provides an optional, tunable quality–throughput tradeoff by relaxing strict distributional equivalence when accepting draft tokens. We recently published FSD at ACL 2025 and are happy to provide a PR that implements FSD (with all required tests, benchmarks, etc.), but first wanted to check first whether such a PR would be in-scope and would potentially be accepted.
What is FSD: Production deployments often need to dynamically control the latency-quality tradeoff of their generations in response to shifting constraints -- for instance to hit different cost targets across traffic tiers or to adapt to changing hardware / queue conditions. While standard speculative decoding (SD) delivers substantial inference speedups, its gains are effectively binary: since SD strictly accept draft tokens that fully preserve the target model's distribution, it offers no mechanism to intentionally trade a small, controlled amount of quality for materially higher throughput. FSD addresses this limitation by introducing a user-controlled threshold T that modulates draft-token acceptance based on how close target and draft next-token distributions are (i.e., whether the divergence between these distributions is below the threshold). This gives users a simple knob to directly control the draft-token acceptance rate, enabling smooth quality–throughput tuning.
Impact on generation quality: Our results show that FSD provides a smooth, tunable quality–throughput curve. At lower threshold settings, FSD can match SD benchmark accuracy while already delivering noticeably higher throughput (commonly ~10–20% faster). As the threshold increases, FSD yields substantially larger speedups (~30–50% over SD) in exchange for small, controlled quality reductions (typically within ~2% relative degradation on standard benchmarks reported in the paper).
Proposed implementation plan: Given its simplicity, FSD can be implemented to:
- Require minimal changes to the current speculative decoding implementation (simply adding fully optional FSD acceptance logic to the SD rejection path in
acceptDraftTokensKernel) - Be gated and opt-in (default disabled), leaving standard speculative decoding completely untouched unless explicitly enabled
- Be cheap when activated, introducing minimal additional computation only when draft tokens are rejected by standard SD, while leaving the dominant acceptance path unchanged
- Apply to any draft-target model pair out-of-the-box (i.e., fully training-free)
Concretely, I’m proposing we implement the “reducible” variant described in the paper: when FSD is enabled, keep the existing SD acceptance unchanged, and only if a draft token is rejected do an additional “fuzzy” check.
- Run normal SD acceptance for candidate token x_i
- If SD accepts → accept and move to next token (exact current behavior)
- If SD rejects and fsd_enabled:
- Compute a divergence or distance between targetProbs and draftProbs (e.g., KL divergence, or another lightweight metric)
- If Div < T → override rejection and accept
- Else → keep rejection and proceed with standard resampling behavior
Would FSD be considered in-scope for TensorRT-LLM’s speculative decoding support? If so, I'll submit a PR implementing the reducible FSD variant behind an opt-in flag, along with unit tests, accuracy benchmarks, and throughput measurements across representative models.
Thank you!
Alternatives
There are currently no alternatives to FSD that facilitate a smooth generation latency-quality tradeoff for any draft-target model pair.
Additional context
Our ACL 2025 Findings paper introducing FSD can be found HERE. You can also find our Hugging Face custom_generate implementation of FSD at maxholsman/fuzzy-spec-dec, which lets users run FSD directly in transformers with any draft-target model pair available through on HF.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.