Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support by lesj0610 · Pull Request #152 · turboderp-org/exllamav3

lesj0610 · 2026-02-25T13:13:40Z

Goal

Complete the ExLlamaV3 runtime migration to flashinfer, keep the new backend stable across the existing serving stack, and land the model/architecture work that became necessary during the migration.

Status

This PR now represents a completed flashinfer backend migration.

What remains after this PR is follow-up optimization work, not migration work.

What Changed

Backend migration

Switched the primary attention runtime path to flashinfer.
Kept safe fallbacks where the backend cannot support a given fast path.
Added runtime stats toggles for hot-path inspection:
- EXLLAMAV3_RUNTIME_STATS=1
- EXLLAMAV3_RUNTIME_STATS_INTERVAL=<N>

Core runtime optimizations (common path)

Reuse FlashInfer prefill wrappers/plans across layers.
Reuse generator batch metadata buffers (block_index, cache_seqlens, positions).
Avoid torch.cat(...) in the single-token decode hot path.
Reduce non-tensor-core decode replans where the native decode wrapper is actually supported.
Guard zero-length recurrent checkpoint handling in generator/job.py.

Architecture / model support

Added / completed flashinfer-era support for:
- DeepseekV2ForCausalLM / MLA-family aliases
- Qwen3_5ForConditionalGeneration
- Qwen3_5MoeForConditionalGeneration
- DeepseekVLV2ForCausalLM (deepseek-vl2 family)
Added DeepSeek-VL2 EXL3 compile hooks needed to finish quantization.

Multimodal fixes

Stabilized the Qwen multimodal path (Qwen3-VL, Qwen3.5) after the backend transition.
Fixed DeepSeek-VL2 image preprocessing / normalization so text+vision outputs are semantically correct.

Benchmark tooling

Updated eval/perf.py to use the current generator/job path instead of the older direct-forward assumptions.

Validation Summary

Functional validation

Confirmed working in local serving / smoke tests:

EXAONE-4.0.1 (text)
Qwen3-VL (text + vision)
Qwen3.5-35B-A3B (text + vision)
DeepSeek-V2-Lite-Chat BF16 (text)
deepseek-vl2-tiny BF16 and EXL3 4.0bpw (text + vision, single image, multi-image, grounding tags)
Qwen3-Next-80B-A3B (text, after generator/parser fixes)
GLM-4.6V (text + vision; see follow-up notes for remaining decode-path limits)

Flash-attn vs flashinfer comparison (measured)

Representative tested points using the legacy flash-attn tree vs the current flashinfer tree:

Qwen3-Next-80B-A3B-Instruct-exl3-2.0bpw

Current flashinfer tree:

Prefill: 256 = 196.19 tok/s, 512 = 338.12 tok/s
Generate: 0 = 49.46 tok/s, 256 = 49.29 tok/s, 512 = 49.16 tok/s

Legacy flash-attn tree:

Prefill: 256 = 188.23 tok/s, 512 = 323.89 tok/s
Generate: 0 = 40.32 tok/s, 256 = 43.65 tok/s, 512 = 43.26 tok/s

At these measured points, the current flashinfer path is faster than the legacy flash-attn path.

GLM-4.6V-exl3-2.0bpw

Current flashinfer tree:

Short-spot prefill: about 867 / 860 tok/s (256 / 512)
Generate: about 36 tok/s

Legacy flash-attn tree:

Prefill was comparable at short lengths, but the legacy generate benchmark path crashed later in the run.

So for GLM the migration is functionally stable, and the current tree now completes the tested generate path; deeper backend-specific tuning remains a follow-up.

Newly added / newly completed model benchmarks

deepseek-vl2-tiny-EXL3-4.0bpw

Prefill:
- 256 = 4332.33 tok/s
- 512 = 8488.48 tok/s
- 1024 = 8323.03 tok/s
Generate:
- 0 = 112.37 tok/s
- 256 = 111.55 tok/s
- 512 = 111.71 tok/s
- 1024 = 111.53 tok/s
Real text/vision chat was also measured at roughly 105-111 tok/s.

Qwen3.5-35B-A3B-EXL3-4.00bpw

Prefill (spot check): about 240.61 tok/s
Generate (spot checks): about 35.31 / 35.12 tok/s

Qwen3-VL-8B-Instruct-EXL3-4.0bpw

Prefill (spot checks): about 3491.32 / 3430.09 tok/s
Generate (spot checks): about 57.52 / 57.38 tok/s

Important follow-up (not blocked by this PR)

The remaining major performance ceiling is now backend-specific, not migration-specific.

Two concrete examples:

EXAONE: native FlashInfer non-tensor-core decode hits Unsupported group_size: 5
GLM-4.6V: native FlashInfer non-tensor-core decode hits Unsupported group_size: 12

This means:

the flashinfer migration is complete,
but the last major gains for those models now require either:
- backend-level FlashInfer support expansion, or
- a separate high-risk decode-strategy project.

That work is intentionally left as a follow-up so this PR stays reviewable and keeps the hot path maintainable.

Reviewer Focus

Please focus review on:

flashinfer runtime path correctness and fallback invariants
common-path performance changes (attn.py, generator.py, job.py)
DeepSeek-VL2 architecture / quantization support
Qwen multimodal stability after the backend transition
any regressions in long-context behavior for supported fast paths

lesj0610 · 2026-02-25T13:30:08Z

Review request update:

I updated the PR description with backend-swap scope, EXL3 compatibility note, and risk hotspots.

Please review with focus on:

attn.py paged-KV metadata + flashinfer wrapper plan/run correctness
Legacy attention mode alias compatibility (flash_attn* accepted)
Long-context/sliding-window behavior after kernel backend change

- add DeepSeek V2/GLM-MoE-DSA architecture aliases and MLA attention module - add Qwen3.5 dense/moe architecture paths and parser integration - extend MoE/GDN projection loading for split tensor layouts - add smoke/quality test scripts and MLA support matrix doc - improve quantization ETA reporting stability in convert flow

- use HF chat template for prompt formatting - propagate recurrent_states across decode/NLL passes

lesj0610 · 2026-02-26T10:29:13Z

Qwen3.5-35B-A3B EXL3 quantization completed, then retested (English + Korean).

What was fixed before retest

eval/quality_smoke_multilingual.py
eval/quality_regression_en_zh.py
Applied HF chat template and propagated recurrent states through decode/NLL loops.

Retest results (Qwen3.5-35B-A3B-EXL3-4.00bpw)

Model: /ssd512g/models/Qwen3.5-35B-A3B-EXL3-4.00bpw
GPU split: 23,23

decode-only speed (manual greedy path):
- decode wrapper on: ~33.6-34.1 tok/s
- decode wrapper off: ~28.9-29.3 tok/s
generation quality:
- English output is consistently gibberish / mixed random fragments
- Korean output fails language constraint and is also gibberish (missing Hangul in smoke checks)

Examples from current outputs show repeated fragments such as -sup, mixed CJK/latin tokens, and no coherent answering behavior.

Control check

DeepSeek-V2-Lite EXL3 under the same exllamav3 runtime still produces partially coherent outputs, so this is not just a generic test harness failure.

Interim conclusion

Current Qwen3.5 path is integrated and runnable, but E2E output quality is currently broken in this branch state despite acceptable decode throughput. This indicates a likely architecture/tensor-mapping/runtime correctness issue for Qwen3.5 rather than a pure performance problem.

I will continue with root-cause narrowing in follow-up commits.

- avoid fused op layout path for split in_proj_qkv/z and in_proj_b/a - compute mixed_qkv/z/beta/g directly from split projections - keep fused C++ path for fused qkvz/ba projections

lesj0610 · 2026-02-26T10:40:00Z

Follow-up: Qwen3.5 gibberish regression is now fixed in this branch.

Root cause (vLLM parity check)

The bug was in GatedDeltaNet split-projection handling for Qwen3.5 (in_proj_qkv/in_proj_z/in_proj_b/in_proj_a).
We were sending split outputs through the fused layout helper path, which assumes packed layout semantics not valid for Qwen3.5 split projections.

Fix

Commit: c86c974

File: exllamav3/modules/gated_delta_net.py
Change:
- keep fused helper path only for fused projections (qkvz_proj + ba_proj)
- for split projections, compute mixed_qkv, z, beta, g directly from split tensors

Retest (same quantized model)

Model: /ssd512g/models/Qwen3.5-35B-A3B-EXL3-4.00bpw

eval/quality_smoke_multilingual.py:
- english_reasoning: PASS
- korean_reasoning: PASS
- korean_math: PASS
10-prompt EN/KO regression smoke:
- avg decode speed: ~31.2 tok/s
- gibberish markers (-sup/viste spam): 0/10

Note: model often emits English-style "Thinking Process" headings unless system prompting strongly enforces language style, but the previous random-fragment gibberish regression is resolved.

lesj0610 · 2026-02-26T11:06:12Z

Added curl-based gateway regression smoke (no extra server process) in eval/gateway_regression_smoke.py.

What it checks:

OpenAI-compatible SSE stream parsing from curl -sN
EN/KO language sanity
gibberish marker regression
timeout handling
optional minimum generation TPS gate

Local run used existing gateway:

python3 -u eval/gateway_regression_smoke.py   --endpoint http://localhost:8088/v1/chat/completions   --api_key shared_key_ONLYONE   --model default   --timeout_sec 120   --min_gen_tps 20

Result:

en_basic PASS, gen_tps ~24.62
ko_basic PASS, gen_tps ~26.70
ko_math PASS, gen_tps ~26.57
Final: All gateway regression smoke checks passed.

rcouture27 · 2026-02-27T05:14:22Z

I pulled this down and tried it with Qwen 3.5 35B. I was getting around 28 t/s on a 4090 with 4bpw quant running it on TabbyAPI. I was hoping it was gonna be faster than the gguf, but the gguf is still significantly faster inference for some reason. Either way, good work on this, seems a lot better than when I tried qwen 3 next with exllamav3.

turboderp · 2026-03-01T19:43:14Z

So I'm going over this, and it's a bit of a headache with both the FlashInfer switch and multiple new architectures in the same PR.

I've been putting off FlashInfer because there are numerous concerns with it, including Windows compatibility, the need for JIT compilation and possible performance regressions (a lot of work has gone into minimizing CPU overhead around the flash-attn kernel invocations.) There are also some models that just don't work at all with FlashInfer due to head dimensions.

Of course, on the other hand, FlashInfer seems to be the only backend that currently supports MLA and attention sinks on consumer GPUs, and both are badly needed (the latter would allow gpt-oss to work). But I would still prefer if this was switchable and not the default for models that don't actually need it, since that's opening a whole can of worms. Just testing that nothing has broken correctness or performance-wise on any of the many supported architectures could take weeks.

At a glance the DeepSeek implementation looks pretty good. Whole new attention module would need a bunch of testing, but there's nothing that stands out as wrong, though I don't really have any experience with MLA myself yet. Routing kernel needs to be updated with topk groups, but using a Torch fallback until then should be okay.

Qwen3.5 also looks solid. I was working on that myself and didn't get very far, but far enough to determine it should be able to leverage most of the existing Qwen3-Next code, same as in this commit. I was working off the 397B model since the smaller ones weren't out when I started, so testing was painfully slow. This part I could probably merge as is and then optimize later.

I'm going to see if there's a way for me to break this up into smaller chunks to go through it piece by piece:

FlashInfer
DeepSeek support
Qwen3.5 support
Regression tests

turboderp · 2026-03-01T19:43:38Z

(Thanks, by the way, this is great work.) (:

lesj0610 · 2026-03-02T07:21:43Z

I agree that forcing FlashInfer as the only/default backend is too aggressive for upstream right now.

A better direction would be to keep FlashAttention and FlashInfer side-by-side, and select the best supported backend at initialization based on:

which dependencies are actually installed,
the model’s capabilities and constraints,
and the stability/performance characteristics of each path.

In practice, that means:

only expose backend choices that are actually available in the environment,
keep backend selection fixed at model/layer init time (not per-token),
use FlashInfer where it is required or clearly beneficial (for example MLA-capable models),
keep the existing stable paths where FlashInfer still has coverage gaps,
and retain SDPA as the universal fallback.

So I do not think upstream should be “FlashInfer-only” today, but I do think it should move toward a capability-based backend abstraction where FlashInfer and FlashAttention can coexist cleanly.

This reverts commit 1442baa.

doublex · 2026-03-07T11:42:20Z

Would this allow Exl3 to be used with the Nvidia Turing architecture?

Switch ExLlamaV3 attention path from flash-attn to flashinfer

524953f

lesj0610 changed the title ~~Switch ExLlamaV3 attention backend from flash-attn to flashinfer~~ Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support Feb 26, 2026

test: fix qwen3.5 multilingual eval harness

c0acbe0

- use HF chat template for prompt formatting - propagate recurrent_states across decode/NLL passes

fix: correct qwen3.5 split GDN projection path

c86c974

- avoid fused op layout path for split in_proj_qkv/z and in_proj_b/a - compute mixed_qkv/z/beta/g directly from split projections - keep fused C++ path for fused qkvz/ba projections

lesj0610 added 2 commits February 26, 2026 19:40

docs: update qwen3.5 validation status after fix

2609b25

test: add curl-based gateway regression smoke

6e70d37

fix(flashinfer): grow workspace on prefill overflow

5a52745

rcouture27 mentioned this pull request Mar 1, 2026

Qwen3.5 VL Support? #154

Closed

fix(multimodal): stabilize qwen vision path and update perf bench

27593ae

lesj0610 added 7 commits March 2, 2026 12:08

feat(multimodal): add DeepSeek-VL2 model support

6852130

fix(generator): guard zero-length recurrent checkpoint

1fbb848

chore(debug): add optional runtime hot-path stats

1b9b33c

perf(flashinfer): reuse prefill plans across layers

c153c90

perf(generator): reuse batch metadata buffers

44b9a7a

perf(generator): avoid batch cat on single-token decode

db029d3

perf(flashinfer): reduce non-tc decode replans

0bbecbb

lesj0610 added 5 commits March 2, 2026 16:32

refactor(attn): restore switchable flash-attn backend

bce2431

feat(generator): support pinned attention backend

49b7034

fix(flashinfer): refresh decode indices on plan reuse

8452a48

refactor(generator): pin backend-specific runtime paths

5de5a60

feat(attn): make auto backend capability-aware

e8f3edf

lesj0610 added 12 commits March 2, 2026 20:42

feat(glm): add GLM-4.7-Flash architecture support

517351b

fix(quant): support q-lora MLA conversion paths

932bf1e

fix(mla): reduce quantized cache and absorption init failures

be08b64

fix(linear-attn): restore gated delta output width

1442baa

fix(convert): use explicit cuda devices in parallel mode

1d02ff5

fix(convert): stabilize multi-gpu quantization paths

8b57959

fix(convert): disable outer parallel mode on multi-gpu

7fcc740

fix(quant): avoid premature multi-gpu sync on secondary devices

f995fb3

fix(convert): preserve outer parallel mode on multi-gpu

2bef615

Revert "fix(linear-attn): restore gated delta output width"

b4234fd

This reverts commit 1442baa.

sync(linear-attn): restore dev baseline for gated delta modules

04aa8ad

sync(qwen): apply maintainer qwen3.5 follow-up from dev

3e38f07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support#152

Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support#152
lesj0610 wants to merge 32 commits intoturboderp-org:masterfrom
lesj0610:feat/flashinfer-backend-migration

lesj0610 commented Feb 25, 2026 •

edited

Loading

Uh oh!

lesj0610 commented Feb 25, 2026

Uh oh!

lesj0610 commented Feb 26, 2026

Uh oh!

lesj0610 commented Feb 26, 2026

Uh oh!

lesj0610 commented Feb 26, 2026

Uh oh!

rcouture27 commented Feb 27, 2026 •

edited

Loading

Uh oh!

turboderp commented Mar 1, 2026

Uh oh!

turboderp commented Mar 1, 2026

Uh oh!

lesj0610 commented Mar 2, 2026

Uh oh!

doublex commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lesj0610 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Status

What Changed

Backend migration

Core runtime optimizations (common path)

Architecture / model support

Multimodal fixes

Benchmark tooling

Validation Summary

Functional validation

Flash-attn vs flashinfer comparison (measured)

Qwen3-Next-80B-A3B-Instruct-exl3-2.0bpw

GLM-4.6V-exl3-2.0bpw

Newly added / newly completed model benchmarks

deepseek-vl2-tiny-EXL3-4.0bpw

Qwen3.5-35B-A3B-EXL3-4.00bpw

Qwen3-VL-8B-Instruct-EXL3-4.0bpw

Important follow-up (not blocked by this PR)

Reviewer Focus

Uh oh!

lesj0610 commented Feb 25, 2026

Uh oh!

lesj0610 commented Feb 26, 2026

What was fixed before retest

Retest results (Qwen3.5-35B-A3B-EXL3-4.00bpw)

Control check

Interim conclusion

Uh oh!

lesj0610 commented Feb 26, 2026

Root cause (vLLM parity check)

Fix

Retest (same quantized model)

Uh oh!

lesj0610 commented Feb 26, 2026

Uh oh!

rcouture27 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboderp commented Mar 1, 2026

Uh oh!

turboderp commented Mar 1, 2026

Uh oh!

lesj0610 commented Mar 2, 2026

Uh oh!

doublex commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lesj0610 commented Feb 25, 2026 •

edited

Loading

rcouture27 commented Feb 27, 2026 •

edited

Loading