-
Couldn't load subscription status.
- Fork 184
Description
What would you like to be added:
SWA doesn't take into account the full prefix.
https://docs.vllm.ai/en/latest/design/hybrid_kv_cache_manager.html#prefix-caching
While the full prefix matching algorithm should still improve over no prefix aware routing, a SWA optimized algorithm that aligns with vLLM eviction for SWA should work better.
A small design doc should be presented to discuss the implementation and how it works with existing full prefix matching algorithm.
The key here is that the indexer needs to capture:
- num layers using full attention
- num layers using SWA
- SW size
And with the above info the indexer can better simulate the cache eviction process to be as close as possible with the inference engine.
Why is this needed: