Skip to content

Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support#152

Open
lesj0610 wants to merge 32 commits intoturboderp-org:masterfrom
lesj0610:feat/flashinfer-backend-migration
Open

Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support#152
lesj0610 wants to merge 32 commits intoturboderp-org:masterfrom
lesj0610:feat/flashinfer-backend-migration

Conversation

@lesj0610
Copy link
Contributor

@lesj0610 lesj0610 commented Feb 25, 2026

Goal

Complete the ExLlamaV3 runtime migration to flashinfer, keep the new backend stable across the existing serving stack, and land the model/architecture work that became necessary during the migration.

Status

This PR now represents a completed flashinfer backend migration.

What remains after this PR is follow-up optimization work, not migration work.

What Changed

Backend migration

  • Switched the primary attention runtime path to flashinfer.
  • Kept safe fallbacks where the backend cannot support a given fast path.
  • Added runtime stats toggles for hot-path inspection:
    • EXLLAMAV3_RUNTIME_STATS=1
    • EXLLAMAV3_RUNTIME_STATS_INTERVAL=<N>

Core runtime optimizations (common path)

  • Reuse FlashInfer prefill wrappers/plans across layers.
  • Reuse generator batch metadata buffers (block_index, cache_seqlens, positions).
  • Avoid torch.cat(...) in the single-token decode hot path.
  • Reduce non-tensor-core decode replans where the native decode wrapper is actually supported.
  • Guard zero-length recurrent checkpoint handling in generator/job.py.

Architecture / model support

  • Added / completed flashinfer-era support for:
    • DeepseekV2ForCausalLM / MLA-family aliases
    • Qwen3_5ForConditionalGeneration
    • Qwen3_5MoeForConditionalGeneration
    • DeepseekVLV2ForCausalLM (deepseek-vl2 family)
  • Added DeepSeek-VL2 EXL3 compile hooks needed to finish quantization.

Multimodal fixes

  • Stabilized the Qwen multimodal path (Qwen3-VL, Qwen3.5) after the backend transition.
  • Fixed DeepSeek-VL2 image preprocessing / normalization so text+vision outputs are semantically correct.

Benchmark tooling

  • Updated eval/perf.py to use the current generator/job path instead of the older direct-forward assumptions.

Validation Summary

Functional validation

Confirmed working in local serving / smoke tests:

  • EXAONE-4.0.1 (text)
  • Qwen3-VL (text + vision)
  • Qwen3.5-35B-A3B (text + vision)
  • DeepSeek-V2-Lite-Chat BF16 (text)
  • deepseek-vl2-tiny BF16 and EXL3 4.0bpw (text + vision, single image, multi-image, grounding tags)
  • Qwen3-Next-80B-A3B (text, after generator/parser fixes)
  • GLM-4.6V (text + vision; see follow-up notes for remaining decode-path limits)

Flash-attn vs flashinfer comparison (measured)

Representative tested points using the legacy flash-attn tree vs the current flashinfer tree:

Qwen3-Next-80B-A3B-Instruct-exl3-2.0bpw

Current flashinfer tree:

  • Prefill: 256 = 196.19 tok/s, 512 = 338.12 tok/s
  • Generate: 0 = 49.46 tok/s, 256 = 49.29 tok/s, 512 = 49.16 tok/s

Legacy flash-attn tree:

  • Prefill: 256 = 188.23 tok/s, 512 = 323.89 tok/s
  • Generate: 0 = 40.32 tok/s, 256 = 43.65 tok/s, 512 = 43.26 tok/s

At these measured points, the current flashinfer path is faster than the legacy flash-attn path.

GLM-4.6V-exl3-2.0bpw

Current flashinfer tree:

  • Short-spot prefill: about 867 / 860 tok/s (256 / 512)
  • Generate: about 36 tok/s

Legacy flash-attn tree:

  • Prefill was comparable at short lengths, but the legacy generate benchmark path crashed later in the run.

So for GLM the migration is functionally stable, and the current tree now completes the tested generate path; deeper backend-specific tuning remains a follow-up.

Newly added / newly completed model benchmarks

deepseek-vl2-tiny-EXL3-4.0bpw

  • Prefill:
    • 256 = 4332.33 tok/s
    • 512 = 8488.48 tok/s
    • 1024 = 8323.03 tok/s
  • Generate:
    • 0 = 112.37 tok/s
    • 256 = 111.55 tok/s
    • 512 = 111.71 tok/s
    • 1024 = 111.53 tok/s
  • Real text/vision chat was also measured at roughly 105-111 tok/s.

Qwen3.5-35B-A3B-EXL3-4.00bpw

  • Prefill (spot check): about 240.61 tok/s
  • Generate (spot checks): about 35.31 / 35.12 tok/s

Qwen3-VL-8B-Instruct-EXL3-4.0bpw

  • Prefill (spot checks): about 3491.32 / 3430.09 tok/s
  • Generate (spot checks): about 57.52 / 57.38 tok/s

Important follow-up (not blocked by this PR)

The remaining major performance ceiling is now backend-specific, not migration-specific.

Two concrete examples:

  • EXAONE: native FlashInfer non-tensor-core decode hits Unsupported group_size: 5
  • GLM-4.6V: native FlashInfer non-tensor-core decode hits Unsupported group_size: 12

This means:

  • the flashinfer migration is complete,
  • but the last major gains for those models now require either:
    • backend-level FlashInfer support expansion, or
    • a separate high-risk decode-strategy project.

That work is intentionally left as a follow-up so this PR stays reviewable and keeps the hot path maintainable.

Reviewer Focus

Please focus review on:

  • flashinfer runtime path correctness and fallback invariants
  • common-path performance changes (attn.py, generator.py, job.py)
  • DeepSeek-VL2 architecture / quantization support
  • Qwen multimodal stability after the backend transition
  • any regressions in long-context behavior for supported fast paths

@lesj0610
Copy link
Contributor Author

Review request update:

I updated the PR description with backend-swap scope, EXL3 compatibility note, and risk hotspots.

Please review with focus on:

  1. attn.py paged-KV metadata + flashinfer wrapper plan/run correctness
  2. Legacy attention mode alias compatibility (flash_attn* accepted)
  3. Long-context/sliding-window behavior after kernel backend change

- add DeepSeek V2/GLM-MoE-DSA architecture aliases and MLA attention module

- add Qwen3.5 dense/moe architecture paths and parser integration

- extend MoE/GDN projection loading for split tensor layouts

- add smoke/quality test scripts and MLA support matrix doc

- improve quantization ETA reporting stability in convert flow
@lesj0610 lesj0610 changed the title Switch ExLlamaV3 attention backend from flash-attn to flashinfer Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support Feb 26, 2026
- use HF chat template for prompt formatting

- propagate recurrent_states across decode/NLL passes
@lesj0610
Copy link
Contributor Author

Qwen3.5-35B-A3B EXL3 quantization completed, then retested (English + Korean).

What was fixed before retest

  • eval/quality_smoke_multilingual.py
  • eval/quality_regression_en_zh.py
  • Applied HF chat template and propagated recurrent states through decode/NLL loops.

Retest results (Qwen3.5-35B-A3B-EXL3-4.00bpw)

Model: /ssd512g/models/Qwen3.5-35B-A3B-EXL3-4.00bpw
GPU split: 23,23

  • decode-only speed (manual greedy path):
    • decode wrapper on: ~33.6-34.1 tok/s
    • decode wrapper off: ~28.9-29.3 tok/s
  • generation quality:
    • English output is consistently gibberish / mixed random fragments
    • Korean output fails language constraint and is also gibberish (missing Hangul in smoke checks)

Examples from current outputs show repeated fragments such as -sup, mixed CJK/latin tokens, and no coherent answering behavior.

Control check

DeepSeek-V2-Lite EXL3 under the same exllamav3 runtime still produces partially coherent outputs, so this is not just a generic test harness failure.

Interim conclusion

Current Qwen3.5 path is integrated and runnable, but E2E output quality is currently broken in this branch state despite acceptable decode throughput. This indicates a likely architecture/tensor-mapping/runtime correctness issue for Qwen3.5 rather than a pure performance problem.

I will continue with root-cause narrowing in follow-up commits.

- avoid fused op layout path for split in_proj_qkv/z and in_proj_b/a

- compute mixed_qkv/z/beta/g directly from split projections

- keep fused C++ path for fused qkvz/ba projections
@lesj0610
Copy link
Contributor Author

Follow-up: Qwen3.5 gibberish regression is now fixed in this branch.

Root cause (vLLM parity check)

The bug was in GatedDeltaNet split-projection handling for Qwen3.5 (in_proj_qkv/in_proj_z/in_proj_b/in_proj_a).
We were sending split outputs through the fused layout helper path, which assumes packed layout semantics not valid for Qwen3.5 split projections.

Fix

Commit: c86c974

  • File: exllamav3/modules/gated_delta_net.py
  • Change:
    • keep fused helper path only for fused projections (qkvz_proj + ba_proj)
    • for split projections, compute mixed_qkv, z, beta, g directly from split tensors

Retest (same quantized model)

Model: /ssd512g/models/Qwen3.5-35B-A3B-EXL3-4.00bpw

  • eval/quality_smoke_multilingual.py:
    • english_reasoning: PASS
    • korean_reasoning: PASS
    • korean_math: PASS
  • 10-prompt EN/KO regression smoke:
    • avg decode speed: ~31.2 tok/s
    • gibberish markers (-sup/viste spam): 0/10

Note: model often emits English-style "Thinking Process" headings unless system prompting strongly enforces language style, but the previous random-fragment gibberish regression is resolved.

@lesj0610
Copy link
Contributor Author

Added curl-based gateway regression smoke (no extra server process) in eval/gateway_regression_smoke.py.

What it checks:

  • OpenAI-compatible SSE stream parsing from curl -sN
  • EN/KO language sanity
  • gibberish marker regression
  • timeout handling
  • optional minimum generation TPS gate

Local run used existing gateway:

python3 -u eval/gateway_regression_smoke.py   --endpoint http://localhost:8088/v1/chat/completions   --api_key shared_key_ONLYONE   --model default   --timeout_sec 120   --min_gen_tps 20

Result:

  • en_basic PASS, gen_tps ~24.62
  • ko_basic PASS, gen_tps ~26.70
  • ko_math PASS, gen_tps ~26.57
  • Final: All gateway regression smoke checks passed.

@rcouture27
Copy link

rcouture27 commented Feb 27, 2026

I pulled this down and tried it with Qwen 3.5 35B. I was getting around 28 t/s on a 4090 with 4bpw quant running it on TabbyAPI. I was hoping it was gonna be faster than the gguf, but the gguf is still significantly faster inference for some reason. Either way, good work on this, seems a lot better than when I tried qwen 3 next with exllamav3.

@rcouture27 rcouture27 mentioned this pull request Mar 1, 2026
@turboderp
Copy link
Member

So I'm going over this, and it's a bit of a headache with both the FlashInfer switch and multiple new architectures in the same PR.

I've been putting off FlashInfer because there are numerous concerns with it, including Windows compatibility, the need for JIT compilation and possible performance regressions (a lot of work has gone into minimizing CPU overhead around the flash-attn kernel invocations.) There are also some models that just don't work at all with FlashInfer due to head dimensions.

Of course, on the other hand, FlashInfer seems to be the only backend that currently supports MLA and attention sinks on consumer GPUs, and both are badly needed (the latter would allow gpt-oss to work). But I would still prefer if this was switchable and not the default for models that don't actually need it, since that's opening a whole can of worms. Just testing that nothing has broken correctness or performance-wise on any of the many supported architectures could take weeks.

At a glance the DeepSeek implementation looks pretty good. Whole new attention module would need a bunch of testing, but there's nothing that stands out as wrong, though I don't really have any experience with MLA myself yet. Routing kernel needs to be updated with topk groups, but using a Torch fallback until then should be okay.

Qwen3.5 also looks solid. I was working on that myself and didn't get very far, but far enough to determine it should be able to leverage most of the existing Qwen3-Next code, same as in this commit. I was working off the 397B model since the smaller ones weren't out when I started, so testing was painfully slow. This part I could probably merge as is and then optimize later.

I'm going to see if there's a way for me to break this up into smaller chunks to go through it piece by piece:

  • FlashInfer
  • DeepSeek support
  • Qwen3.5 support
  • Regression tests

@turboderp
Copy link
Member

(Thanks, by the way, this is great work.) (:

@lesj0610
Copy link
Contributor Author

lesj0610 commented Mar 2, 2026

I agree that forcing FlashInfer as the only/default backend is too aggressive for upstream right now.

A better direction would be to keep FlashAttention and FlashInfer side-by-side, and select the best supported backend at initialization based on:

  • which dependencies are actually installed,
  • the model’s capabilities and constraints,
  • and the stability/performance characteristics of each path.

In practice, that means:

  • only expose backend choices that are actually available in the environment,
  • keep backend selection fixed at model/layer init time (not per-token),
  • use FlashInfer where it is required or clearly beneficial (for example MLA-capable models),
  • keep the existing stable paths where FlashInfer still has coverage gaps,
  • and retain SDPA as the universal fallback.

So I do not think upstream should be “FlashInfer-only” today, but I do think it should move toward a capability-based backend abstraction where FlashInfer and FlashAttention can coexist cleanly.

@doublex
Copy link

doublex commented Mar 7, 2026

Would this allow Exl3 to be used with the Nvidia Turing architecture?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants