Prefix cache scorer should support hybrid memory allocator

**What would you like to be added**:

SWA doesn't take into account the full prefix.
https://docs.vllm.ai/en/latest/design/hybrid_kv_cache_manager.html#prefix-caching

While the full prefix matching algorithm should still improve over no prefix aware routing, a SWA optimized algorithm that aligns with vLLM eviction for SWA should work better.

A small design doc should be presented to discuss the implementation and how it works with existing full prefix matching algorithm.

The key here is that the indexer needs to capture:
- num layers using full attention
- num layers using SWA
- SW size

And with the above info the indexer can better simulate the cache eviction process to be as close as possible with the inference engine.

**Why is this needed**:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prefix cache scorer should support hybrid memory allocator #1775

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Prefix cache scorer should support hybrid memory allocator #1775

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions