Releases: turboderp-org/exllamav3
Releases · turboderp-org/exllamav3
0.0.30
- Less memory overhead for uncached attn with large head sizes
- Switchable, uncached sliding-window mode with checkpoints to reduce cache size in Gemma4 and Step3.5
- More accurate VRAM estimation for autosplit loader
- Reduced K/V quantization overhead
- EXL3 GEMM kernel optimizations
- AVX-512 support (TP all-reduce)
- Various bugfixes
Full Changelog: v0.0.29...v0.0.30
0.0.29
- Support Gemma4ForConditionalGeneration
- Fix bug causing quantizer to allocate too much system RAM on resume
- Fix bug causing potential segfaults when saving large tensors
- Add loop detection option
- More benchmarks
- QoL improvements
- Other bugfixes
- Add Torch 2.11 wheels
- Add Python 3.14 wheels (Torch 2.9+ only)
Full Changelog: v0.0.28...v0.0.29
0.0.28
- Fix regression breaking inference for GLM4.5-Air and related models
Full Changelog: v0.0.27...v0.0.28
0.0.27
- New and more robust allocation strategy for non-integer bitrates
- Added
-hqargument to quantizer (explanation here) - Fix bug causing prompt caching to fail on recurrent models for certain combinations of prompt length and chunk size
- Fix broken output when using repetition penalties without decay range (affecting some OAI clients via TabbyAPI)
- Fix issue allowing recurrent state to fall out of sync with K/V cache
- Support more features in Nanochat, for some reason
- Other fixes and QoL improvements
Full Changelog: v0.0.26...v0.0.27
0.0.26
- Fused expert kernel for improved prompt and batch throughput on MoE models
- Support OlmoHybridForCausalLM
- Fix non-integer bitrates when quantizing models with a very large MLP layers
- Minor bugfixes
- QoL improvements
Full Changelog: v0.0.25...v0.0.26
0.0.25
- Add Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM
- Support Qwen3.5 finetunes saved entirely in BF16 format
- Correct tensor format for Qwen3.5 models with split experts (support REAPed models)
Full Changelog: v0.0.24...v0.0.25
0.0.24
- Faster MoE routing with graphs
- Fix regression breaking GLM 4.7
Full Changelog: v0.0.23...v0.0.24
0.0.23
- Support Qwen 3.5 (Qwen3_5ForConditionalGeneration, Qwen3_5MoeForConditionalGeneration)
- Support Step 3.5 (Step3p5ForCausalLM)
- Enable tensor-P support for Minimax-M2
- Switch quantizer to use out_scales by default
- Include Torch 2.10 wheels
- Various bugfixes, optimizations and QoL improvements
Full Changelog: v0.0.22...v0.0.23
0.0.22
- Fix regression causing models with preserved bf16 tensors (multimodal specifically) to fail quantization
Full Changelog: v0.0.21...v0.0.22
0.0.21
- Fix regression affecting Qwen3-Next
- Avoid using
safetensorslib during quantization (fixes OoM errors sometimes)
Full Changelog: v0.0.20...v0.0.21