-
Notifications
You must be signed in to change notification settings - Fork 24
Kv cache squat #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Kv cache squat #1
Conversation
|
cc @MekkCyber |
|
@MekkCyber Sorry to bother you --- just wondering if you could take a look at this PR when you have time. Happy to make any changes needed! |
|
Sorry @haowang94 forgot about it ! will take a look asap |
MekkCyber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @phymhan! The project looks great — thanks for sharing. That said, it adds quite a bit of complexity on the model side. Some of that logic might be better suited for a new cache implementation instead, like what’s done here:
https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L821
If you’re interested in exploring an integration into transformers, I’d be happy to help! The only limitation I see for now is that we probably can’t support calling kernels inside the model for on-the-fly unpacking & dequantization, we would dequantize on the cache side.
|
Hi @MekkCyber Definitely---this sounds interesting to us. Thanks for the suggestion and willingness to help! We'll experiment with the current implementation of quantized cache class and explore integrating this method as a new cache implementation. We'll keep you posted. |
|
Sounds great ! very excited about this 🔥 |
|
Hey @MekkCyber, just a quick follow-up, we've made a PR here: huggingface/transformers#38055. Would love your thoughts when you get a chance! |
Motivation
The KV cache stores the intermediate representations from previous tokens to accelerate autoregressive decoding. For long sequences, the KV cache can consume more GPU memory than the model weights. During inference, LLM decoding becomes memory-bound, with most of the time spent on data transfer rather than computation. This has led to active research on KV cache quantization, but quantization errors can accumulate as more tokens are generated, causing later tokens to deviate from expected outputs.
This PR
This PR adds the state-of-the-art training-free KV cache quantization method: SQuat (Subspace-orthogonal KV cache quantization). It can significantly reduce memory overhead and latency while maintaining model accuracy.
SQuat constructs a subspace that captures critical task-relevant information, then enforces quantization errors to lie orthogonal to this subspace, minimizing their effect on the output of the attention mechanism.
🌟 Highlights
⚡ Efficient
🏃🏻 Example
Run
example.pyor: