[Feature]: Adding support for Fuzzy Speculative Decoding (ACL Findings 2025)

### 🚀 The feature, motivation and pitch

Hi everyone -- I’d like to propose adding support for Fuzzy Speculative Decoding (FSD), a small extension to speculative decoding that provides an **optional, tunable quality–throughput tradeoff** by relaxing strict distributional equivalence when accepting draft tokens. We recently published FSD at ACL 2025 and are happy to provide a PR that implements FSD (with all required tests, benchmarks, etc.), but first wanted to check first whether such a PR would be in-scope and would potentially be accepted.

**What is FSD:** Production deployments often need to dynamically control the latency-quality tradeoff of their generations in response to shifting constraints -- for instance to hit different cost targets across traffic tiers or to adapt to changing hardware / queue conditions. While standard speculative decoding (SD) delivers substantial inference speedups, its gains are effectively binary: since SD strictly accept draft tokens that fully preserve the target model's distribution, it offers no mechanism to intentionally trade a small, controlled amount of quality for materially higher throughput. **FSD addresses this limitation by introducing a user-controlled threshold T that modulates draft-token acceptance** based on how close target and draft next-token distributions are (i.e., whether the divergence between these distributions is below the threshold). This gives users a simple knob to directly control the draft-token acceptance rate, enabling smooth quality–throughput tuning.

**Impact on generation quality:** Our results show that FSD provides a smooth, tunable quality–throughput curve. At lower threshold settings, FSD can match SD benchmark accuracy while already delivering noticeably higher throughput (commonly ~10–20% faster). As the threshold increases, FSD yields substantially larger speedups (~30–50% over SD) in exchange for small, controlled quality reductions (typically within ~2% relative degradation on standard benchmarks reported in the paper).

**Proposed implementation plan:** Given its simplicity, FSD can be implemented to:
- Require minimal changes to the current speculative decoding implementation (simply adding fully optional FSD acceptance logic to the SD rejection path in `acceptDraftTokensKernel`)
- Be gated and opt-in (default disabled), leaving standard speculative decoding completely untouched unless explicitly enabled
- Be cheap when activated, introducing minimal additional computation only when draft tokens are rejected by standard SD, while leaving the dominant acceptance path unchanged
- Apply to any draft-target model pair out-of-the-box (i.e., fully training-free)

Concretely, I’m proposing we implement the “reducible” variant described in the paper: when FSD is enabled, keep the existing SD acceptance unchanged, and only if a draft token is rejected do an additional “fuzzy” check.
1. Run normal SD acceptance for candidate token x_i
2. If SD accepts → accept and move to next token (exact current behavior)
3. If SD rejects and fsd_enabled:
    - Compute a divergence or distance between targetProbs and draftProbs (e.g., KL divergence, or another lightweight metric)
    - If Div < T → override rejection and accept
    - Else → keep rejection and proceed with standard resampling behavior

**Would FSD be considered in-scope for TensorRT-LLM’s speculative decoding support?** If so, I'll submit a PR implementing the reducible FSD variant behind an opt-in flag, along with unit tests, accuracy benchmarks, and throughput measurements across representative models.

Thank you!

### Alternatives

There are currently no alternatives to FSD that facilitate a smooth generation latency-quality tradeoff for any draft-target model pair. 

### Additional context

Our ACL 2025 Findings paper introducing FSD can be found [HERE](https://github.com/user-attachments/files/24465007/2025.findings-acl.1346.pdf). You can also find our Hugging Face `custom_generate` implementation of FSD at [maxholsman/fuzzy-spec-dec](https://huggingface.co/maxholsman/fuzzy-spec-dec), which lets users run FSD directly in `transformers` with any draft-target model pair available through on HF. 





### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Adding support for Fuzzy Speculative Decoding (ACL Findings 2025) #10481

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Adding support for Fuzzy Speculative Decoding (ACL Findings 2025) #10481

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions