Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support#152
Switch ExLlamaV3 to flashinfer and add MLA/Qwen3.5 support#152lesj0610 wants to merge 32 commits intoturboderp-org:masterfrom
Conversation
|
Review request update: I updated the PR description with backend-swap scope, EXL3 compatibility note, and risk hotspots. Please review with focus on:
|
- add DeepSeek V2/GLM-MoE-DSA architecture aliases and MLA attention module - add Qwen3.5 dense/moe architecture paths and parser integration - extend MoE/GDN projection loading for split tensor layouts - add smoke/quality test scripts and MLA support matrix doc - improve quantization ETA reporting stability in convert flow
- use HF chat template for prompt formatting - propagate recurrent_states across decode/NLL passes
|
Qwen3.5-35B-A3B EXL3 quantization completed, then retested (English + Korean). What was fixed before retest
Retest results (Qwen3.5-35B-A3B-EXL3-4.00bpw)Model:
Examples from current outputs show repeated fragments such as Control checkDeepSeek-V2-Lite EXL3 under the same exllamav3 runtime still produces partially coherent outputs, so this is not just a generic test harness failure. Interim conclusionCurrent Qwen3.5 path is integrated and runnable, but E2E output quality is currently broken in this branch state despite acceptable decode throughput. This indicates a likely architecture/tensor-mapping/runtime correctness issue for Qwen3.5 rather than a pure performance problem. I will continue with root-cause narrowing in follow-up commits. |
- avoid fused op layout path for split in_proj_qkv/z and in_proj_b/a - compute mixed_qkv/z/beta/g directly from split projections - keep fused C++ path for fused qkvz/ba projections
|
Follow-up: Qwen3.5 gibberish regression is now fixed in this branch. Root cause (vLLM parity check)The bug was in FixCommit:
Retest (same quantized model)Model:
Note: model often emits English-style "Thinking Process" headings unless system prompting strongly enforces language style, but the previous random-fragment gibberish regression is resolved. |
|
Added curl-based gateway regression smoke (no extra server process) in What it checks:
Local run used existing gateway: python3 -u eval/gateway_regression_smoke.py --endpoint http://localhost:8088/v1/chat/completions --api_key shared_key_ONLYONE --model default --timeout_sec 120 --min_gen_tps 20Result:
|
|
I pulled this down and tried it with Qwen 3.5 35B. I was getting around 28 t/s on a 4090 with 4bpw quant running it on TabbyAPI. I was hoping it was gonna be faster than the gguf, but the gguf is still significantly faster inference for some reason. Either way, good work on this, seems a lot better than when I tried qwen 3 next with exllamav3. |
|
So I'm going over this, and it's a bit of a headache with both the FlashInfer switch and multiple new architectures in the same PR. I've been putting off FlashInfer because there are numerous concerns with it, including Windows compatibility, the need for JIT compilation and possible performance regressions (a lot of work has gone into minimizing CPU overhead around the flash-attn kernel invocations.) There are also some models that just don't work at all with FlashInfer due to head dimensions. Of course, on the other hand, FlashInfer seems to be the only backend that currently supports MLA and attention sinks on consumer GPUs, and both are badly needed (the latter would allow gpt-oss to work). But I would still prefer if this was switchable and not the default for models that don't actually need it, since that's opening a whole can of worms. Just testing that nothing has broken correctness or performance-wise on any of the many supported architectures could take weeks. At a glance the DeepSeek implementation looks pretty good. Whole new attention module would need a bunch of testing, but there's nothing that stands out as wrong, though I don't really have any experience with MLA myself yet. Routing kernel needs to be updated with topk groups, but using a Torch fallback until then should be okay. Qwen3.5 also looks solid. I was working on that myself and didn't get very far, but far enough to determine it should be able to leverage most of the existing Qwen3-Next code, same as in this commit. I was working off the 397B model since the smaller ones weren't out when I started, so testing was painfully slow. This part I could probably merge as is and then optimize later. I'm going to see if there's a way for me to break this up into smaller chunks to go through it piece by piece:
|
|
(Thanks, by the way, this is great work.) (: |
|
I agree that forcing FlashInfer as the only/default backend is too aggressive for upstream right now. A better direction would be to keep FlashAttention and FlashInfer side-by-side, and select the best supported backend at initialization based on:
In practice, that means:
So I do not think upstream should be “FlashInfer-only” today, but I do think it should move toward a capability-based backend abstraction where FlashInfer and FlashAttention can coexist cleanly. |
This reverts commit 1442baa.
|
Would this allow Exl3 to be used with the Nvidia Turing architecture? |
Goal
Complete the ExLlamaV3 runtime migration to
flashinfer, keep the new backend stable across the existing serving stack, and land the model/architecture work that became necessary during the migration.Status
This PR now represents a completed flashinfer backend migration.
What remains after this PR is follow-up optimization work, not migration work.
What Changed
Backend migration
flashinfer.EXLLAMAV3_RUNTIME_STATS=1EXLLAMAV3_RUNTIME_STATS_INTERVAL=<N>Core runtime optimizations (common path)
block_index,cache_seqlens,positions).torch.cat(...)in the single-token decode hot path.generator/job.py.Architecture / model support
DeepseekV2ForCausalLM/ MLA-family aliasesQwen3_5ForConditionalGenerationQwen3_5MoeForConditionalGenerationDeepseekVLV2ForCausalLM(deepseek-vl2family)Multimodal fixes
Qwen3-VL,Qwen3.5) after the backend transition.Benchmark tooling
eval/perf.pyto use the current generator/job path instead of the older direct-forward assumptions.Validation Summary
Functional validation
Confirmed working in local serving / smoke tests:
EXAONE-4.0.1(text)Qwen3-VL(text + vision)Qwen3.5-35B-A3B(text + vision)DeepSeek-V2-Lite-ChatBF16 (text)deepseek-vl2-tinyBF16 and EXL3 4.0bpw (text + vision, single image, multi-image, grounding tags)Qwen3-Next-80B-A3B(text, after generator/parser fixes)GLM-4.6V(text + vision; see follow-up notes for remaining decode-path limits)Flash-attn vs flashinfer comparison (measured)
Representative tested points using the legacy flash-attn tree vs the current flashinfer tree:
Qwen3-Next-80B-A3B-Instruct-exl3-2.0bpw
Current flashinfer tree:
256 = 196.19 tok/s,512 = 338.12 tok/s0 = 49.46 tok/s,256 = 49.29 tok/s,512 = 49.16 tok/sLegacy flash-attn tree:
256 = 188.23 tok/s,512 = 323.89 tok/s0 = 40.32 tok/s,256 = 43.65 tok/s,512 = 43.26 tok/sAt these measured points, the current flashinfer path is faster than the legacy flash-attn path.
GLM-4.6V-exl3-2.0bpw
Current flashinfer tree:
867 / 860 tok/s(256 / 512)36 tok/sLegacy flash-attn tree:
So for GLM the migration is functionally stable, and the current tree now completes the tested generate path; deeper backend-specific tuning remains a follow-up.
Newly added / newly completed model benchmarks
deepseek-vl2-tiny-EXL3-4.0bpw
256 = 4332.33 tok/s512 = 8488.48 tok/s1024 = 8323.03 tok/s0 = 112.37 tok/s256 = 111.55 tok/s512 = 111.71 tok/s1024 = 111.53 tok/s105-111 tok/s.Qwen3.5-35B-A3B-EXL3-4.00bpw
240.61 tok/s35.31 / 35.12 tok/sQwen3-VL-8B-Instruct-EXL3-4.0bpw
3491.32 / 3430.09 tok/s57.52 / 57.38 tok/sImportant follow-up (not blocked by this PR)
The remaining major performance ceiling is now backend-specific, not migration-specific.
Two concrete examples:
EXAONE: native FlashInfer non-tensor-core decode hitsUnsupported group_size: 5GLM-4.6V: native FlashInfer non-tensor-core decode hitsUnsupported group_size: 12This means:
That work is intentionally left as a follow-up so this PR stays reviewable and keeps the hot path maintainable.
Reviewer Focus
Please focus review on:
attn.py,generator.py,job.py)