Skip to content

Releases: turboderp-org/exllamav3

0.0.30

19 Apr 17:50

Choose a tag to compare

  • Less memory overhead for uncached attn with large head sizes
  • Switchable, uncached sliding-window mode with checkpoints to reduce cache size in Gemma4 and Step3.5
  • More accurate VRAM estimation for autosplit loader
  • Reduced K/V quantization overhead
  • EXL3 GEMM kernel optimizations
  • AVX-512 support (TP all-reduce)
  • Various bugfixes

Full Changelog: v0.0.29...v0.0.30

0.0.29

12 Apr 00:33

Choose a tag to compare

  • Support Gemma4ForConditionalGeneration
  • Fix bug causing quantizer to allocate too much system RAM on resume
  • Fix bug causing potential segfaults when saving large tensors
  • Add loop detection option
  • More benchmarks
  • QoL improvements
  • Other bugfixes
  • Add Torch 2.11 wheels
  • Add Python 3.14 wheels (Torch 2.9+ only)

Full Changelog: v0.0.28...v0.0.29

0.0.28

30 Mar 20:05

Choose a tag to compare

  • Fix regression breaking inference for GLM4.5-Air and related models

Full Changelog: v0.0.27...v0.0.28

0.0.27

26 Mar 01:44

Choose a tag to compare

  • New and more robust allocation strategy for non-integer bitrates
  • Added -hq argument to quantizer (explanation here)
  • Fix bug causing prompt caching to fail on recurrent models for certain combinations of prompt length and chunk size
  • Fix broken output when using repetition penalties without decay range (affecting some OAI clients via TabbyAPI)
  • Fix issue allowing recurrent state to fall out of sync with K/V cache
  • Support more features in Nanochat, for some reason
  • Other fixes and QoL improvements

Full Changelog: v0.0.26...v0.0.27

0.0.26

16 Mar 18:57

Choose a tag to compare

  • Fused expert kernel for improved prompt and batch throughput on MoE models
  • Support OlmoHybridForCausalLM
  • Fix non-integer bitrates when quantizing models with a very large MLP layers
  • Minor bugfixes
  • QoL improvements

Full Changelog: v0.0.25...v0.0.26

0.0.25

11 Mar 22:50

Choose a tag to compare

  • Add Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM
  • Support Qwen3.5 finetunes saved entirely in BF16 format
  • Correct tensor format for Qwen3.5 models with split experts (support REAPed models)

Full Changelog: v0.0.24...v0.0.25

0.0.24

08 Mar 19:42

Choose a tag to compare

  • Faster MoE routing with graphs
  • Fix regression breaking GLM 4.7

Full Changelog: v0.0.23...v0.0.24

0.0.23

05 Mar 15:53

Choose a tag to compare

  • Support Qwen 3.5 (Qwen3_5ForConditionalGeneration, Qwen3_5MoeForConditionalGeneration)
  • Support Step 3.5 (Step3p5ForCausalLM)
  • Enable tensor-P support for Minimax-M2
  • Switch quantizer to use out_scales by default
  • Include Torch 2.10 wheels
  • Various bugfixes, optimizations and QoL improvements

Full Changelog: v0.0.22...v0.0.23

0.0.22

10 Feb 16:51

Choose a tag to compare

  • Fix regression causing models with preserved bf16 tensors (multimodal specifically) to fail quantization

Full Changelog: v0.0.21...v0.0.22

0.0.21

09 Feb 21:21

Choose a tag to compare

  • Fix regression affecting Qwen3-Next
  • Avoid using safetensors lib during quantization (fixes OoM errors sometimes)

Full Changelog: v0.0.20...v0.0.21