StrataCache is a standalone tiered storage library for inference workloads. It provides a unified storage API over multiple memory/media layers (CPU memory, CXL, and future tiers) while keeping model/runtime-specific logic in adapters.
The project is designed for two concurrent goals:
- Reuse and manage KV cache for inference engines such as vLLM.
- Reuse the same storage plane for non-KV artifacts such as model parameter chunks used by offload/prefetch flows.
In short, StrataCache is not “KV-only cache code”; it is a generalized artifact storage substrate for LLM systems.
Modern inference systems move data across heterogeneous memory/media (GPU, CPU, CXL, SSD). Most implementations solve this in a task-specific way (for example KV-only), which creates duplicated logic and inconsistent policies.
StrataCache addresses this by:
- Defining a generalized artifact abstraction (
ArtifactId, metadata, bytes payload). - Providing a single tier coordinator (
TierChain) with write-through/write-back semantics. - Exposing one public API (
StorageEngine) used by both adapters and direct clients.
This allows KV caching and parameter offload/prefetch to share the same storage management model without key collisions or policy conflicts.
- vLLM KV cache reuse
- Use
StrataCacheConnectorV1to store/load KV chunks by prefix and chunk boundaries.
- Parameter offload and prefetch
- Use
ParameterStoreClienton top ofStorageEngineto store/load parameter chunks.
- Mixed workloads in one storage plane
- Run KV connector and parameter offload together with unified tier policies.
From the stratacache/ directory:
python -m pip install -U uv
uv venv -p 3.12 .venv
source .venv/bin/activate
uv pip install -e .vllm==0.13.0 is already included in pyproject.toml dependencies.
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8000 \
--kv-transfer-config '{"kv_connector":"StrataCacheConnectorV1","kv_connector_module_path":"stratacache.adapters.vllm.connector_v1","kv_role":"kv_both"}'StrataCache connector reads YAML configuration.
- Default config file in this repo:
config.yaml - Packaged fallback config for installed module:
src/stratacache/config.yaml - Optional override path in vLLM config:
kv_connector_extra_config["stratacache.config_path"]
If no YAML file is found, built-in defaults are used.
| Key | Type | Default | Meaning |
|---|---|---|---|
use_cxl |
bool | false |
Enable CXL tier |
writeback |
bool | false |
Use write-back between CPU and CXL |
cpu_capacity_mb |
int | 61440 |
CPU tier capacity in MB |
chunk_size |
int | 256 |
Token chunk size for KV matching/store/load |
bundle_layers |
bool | true |
Store/load layered KV in bundled format |
tensor_codec |
str | stable |
Tensor payload codec |
tensor_header_in_payload |
bool | false |
Keep dtype/shape in payload header |
save_partial_chunks |
bool | true |
Save/load last partial chunk |
log_stats |
bool | true |
Enable connector stats logging |
log_every |
int | 50 |
Stats logging interval |
log_min_interval_s |
float | 2.0 |
Min seconds between logs |
debug |
bool | false |
Debug logging |
cxl_dax_device |
str/null | null |
CXL DAX device path |
cxl_reset_metadata |
bool | false |
Reset CXL metadata on init |
exporter.file.enabled |
bool | false |
Enable telemetry export to file |
exporter.wandb.enabled |
bool | false |
Enable telemetry export to WandB |
StrataCache separates storage responsibilities from task semantics.
- Artifact model (
core/)
- Generic object identity + metadata + payload.
- Memory layers (
backend/)
- Per-medium implementation (
CpuMemoryLayer,CxlMemoryLayer).
- Tier coordinator (
tiering/TierChain)
- Ordered tiers, read-through lookup, promotion, write-through/write-back propagation.
- Public API (
engine/StorageEngine)
store/load/contains/deletewithchain/exact/preferaccess modes.
- Task adapters (
adapters/)
- vLLM KV connector for prefix/chunk/slot mapping logic.
- Torch parameter client for parameter chunk workflows.
Write path:
- Client calls
StorageEngine.store(...). TierChainwrites to selected tier and applies link policy.- In write-back mode, dirty tracking and background convergence are handled internally.
Read path:
- Client calls
StorageEngine.load(...). TierChainscans tiers top-down (chain) or uses directed tier strategy (exact/prefer).- On lower-tier hit, optional promotion updates upper tiers.
KV-specific matching and token-boundary logic are adapter responsibilities, not storage-core responsibilities.
stratacache/
├── config.yaml
├── pyproject.toml
├── README.md
├── src/stratacache/
│ ├── core/ # artifact schema, codecs, errors, key builder
│ ├── backend/ # CPU/CXL memory layers
│ ├── tiering/ # TierChain and link policies
│ ├── writeback/ # dirty tracking and worker
│ ├── engine/ # StorageEngine public facade
│ └── adapters/
│ ├── vllm/ # KVConnector v1 integration
│ └── torch/ # parameter chunk client
└── tests/ # assert-based tests
PYTHONPATH=$(pwd)/src python tests/run_all.py