Releases · turboderp-org/exllamav3 · GitHub

19 Apr 17:50

0.0.30 Latest

Latest

Less memory overhead for uncached attn with large head sizes
Switchable, uncached sliding-window mode with checkpoints to reduce cache size in Gemma4 and Step3.5
More accurate VRAM estimation for autosplit loader
Reduced K/V quantization overhead
EXL3 GEMM kernel optimizations
AVX-512 support (TP all-reduce)
Various bugfixes

Full Changelog: v0.0.29...v0.0.30

Assets 45

exllamav3-0.0.30+cu128.torch2.10.0-cp310-cp310-linux_x86_64.whl

sha256:c5fb5de36af6aa087695e86593cab3eb0d6976ef73a9a8215469c7e1d928844d

159 MB 2026-04-19T18:21:05Z
exllamav3-0.0.30+cu128.torch2.10.0-cp310-cp310-win_amd64.whl

sha256:f392159d32024a27e083295639bbfdc453c4b2fbfc85388af5cf385d861095c3

146 MB 2026-04-19T18:24:54Z
exllamav3-0.0.30+cu128.torch2.10.0-cp311-cp311-linux_x86_64.whl

sha256:ccffe13a856b3da6d72539a646e6779ae499bb10fa94f8ab68ec6ba0c5c16408

159 MB 2026-04-19T18:21:20Z
exllamav3-0.0.30+cu128.torch2.10.0-cp311-cp311-win_amd64.whl

sha256:01edc791cba0b5117ba04711a298320a647479526ab7b96d67086c91b69859b7

146 MB 2026-04-19T20:04:15Z
exllamav3-0.0.30+cu128.torch2.10.0-cp312-cp312-linux_x86_64.whl

sha256:e8b5f549b9d15e50f93ee63bb414ccbc7b14b0aa6dec2d15eba7fc2761796f19

159 MB 2026-04-19T18:20:38Z
exllamav3-0.0.30+cu128.torch2.10.0-cp312-cp312-win_amd64.whl

sha256:342d82c9419b4a8dd30fd267884778f2e5664160168d75739a2b1c0fdc332c7a

146 MB 2026-04-19T18:25:59Z
exllamav3-0.0.30+cu128.torch2.10.0-cp313-cp313-linux_x86_64.whl

sha256:433b0f6309f7995b7f8ebe7e38a8684ef2cc133a3e5c24253c87955d2e46e39a

159 MB 2026-04-19T18:19:26Z
exllamav3-0.0.30+cu128.torch2.10.0-cp313-cp313-win_amd64.whl

sha256:a9afc75005163a35a7b2e63e6cdaefbec008db7bb15761e0e8277b15ac13075c

146 MB 2026-04-19T19:21:10Z
exllamav3-0.0.30+cu128.torch2.10.0-cp314-cp314-linux_x86_64.whl

sha256:581f735b4044d9c42317d83be9a65b402268e210a15fc5876891be19bd14ac77

159 MB 2026-04-19T18:17:39Z
exllamav3-0.0.30+cu128.torch2.10.0-cp314-cp314-win_amd64.whl

sha256:49cd61052ee887f8b27e53d1b19a8b5c8c8dad20ba6f8ca81a2052336e607e9d

147 MB 2026-04-19T19:28:19Z
Source code (zip)

2026-04-19T17:48:11Z
Source code (tar.gz)

2026-04-19T17:48:11Z

12 Apr 00:33

0.0.29

Support Gemma4ForConditionalGeneration
Fix bug causing quantizer to allocate too much system RAM on resume
Fix bug causing potential segfaults when saving large tensors
Add loop detection option
More benchmarks
QoL improvements
Other bugfixes
Add Torch 2.11 wheels
Add Python 3.14 wheels (Torch 2.9+ only)

Full Changelog: v0.0.28...v0.0.29

Assets 47

30 Mar 20:05

0.0.28

Fix regression breaking inference for GLM4.5-Air and related models

Full Changelog: v0.0.27...v0.0.28

Assets 35

26 Mar 01:44

0.0.27

New and more robust allocation strategy for non-integer bitrates
Added -hq argument to quantizer (explanation here)
Fix bug causing prompt caching to fail on recurrent models for certain combinations of prompt length and chunk size
Fix broken output when using repetition penalties without decay range (affecting some OAI clients via TabbyAPI)
Fix issue allowing recurrent state to fall out of sync with K/V cache
Support more features in Nanochat, for some reason
Other fixes and QoL improvements

Full Changelog: v0.0.26...v0.0.27

Assets 35

16 Mar 18:57

0.0.26

Fused expert kernel for improved prompt and batch throughput on MoE models
Support OlmoHybridForCausalLM
Fix non-integer bitrates when quantizing models with a very large MLP layers
Minor bugfixes
QoL improvements

Full Changelog: v0.0.25...v0.0.26

Assets 35

11 Mar 22:50

0.0.25

Add Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM
Support Qwen3.5 finetunes saved entirely in BF16 format
Correct tensor format for Qwen3.5 models with split experts (support REAPed models)

Full Changelog: v0.0.24...v0.0.25

Assets 35

08 Mar 19:42

0.0.24

Faster MoE routing with graphs
Fix regression breaking GLM 4.7

Full Changelog: v0.0.23...v0.0.24

Assets 35

05 Mar 15:53

0.0.23

Support Qwen 3.5 (Qwen3_5ForConditionalGeneration, Qwen3_5MoeForConditionalGeneration)
Support Step 3.5 (Step3p5ForCausalLM)
Enable tensor-P support for Minimax-M2
Switch quantizer to use out_scales by default
Include Torch 2.10 wheels
Various bugfixes, optimizations and QoL improvements

Full Changelog: v0.0.22...v0.0.23

Assets 35

10 Feb 16:51

0.0.22

Fix regression causing models with preserved bf16 tensors (multimodal specifically) to fail quantization

Full Changelog: v0.0.21...v0.0.22

Assets 27

09 Feb 21:21

0.0.21

Fix regression affecting Qwen3-Next
Avoid using safetensors lib during quantization (fixes OoM errors sometimes)

Full Changelog: v0.0.20...v0.0.21

Assets 27