feat(attn): add switchable flash-attn and flashinfer backends by lesj0610 · Pull Request #156 · turboderp-org/exllamav3

lesj0610 · 2026-03-02T10:11:29Z

Summary

add switchable attention backend policy support (auto, flash_attn, flashinfer, sdpa)
resolve the selected backend once in runtime/generator code and use it consistently for prefill/decode paths
keep recurrent prefill/checkpoint handling compatible with backend switching
limit this PR to backend/core wiring in attn.py, generator.py, job.py, config.py, and transformer.py

Refreshed onto v0.0.24
Model-specific MLA support stays in follow-up PRs
held_text replacement-character handling is intentionally kept out of this PR and remains in generator: always reconstruct held text when replacement chars appear #163

lesj0610 · 2026-03-11T23:11:54Z

Refreshed onto v0.0.24. This branch now only carries the backend/core runtime wiring needed for switchable attention backends.

lesj0610 force-pushed the feat/backend-core-v2 branch from 18bab8b to 97a1e60 Compare March 2, 2026 10:15

feat(attn): refresh backend switching for 0.0.24

a9c593d

lesj0610 force-pushed the feat/backend-core-v2 branch from 153d6e0 to a9c593d Compare March 11, 2026 15:52