-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
Model-driven KV cache compression that uses the LRM itself (not attention heuristics) to identify and evict stale tool outputs during long agentic tasks.
Source: arXiv 2602.22603 — "SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning" (Kariyappa & Suh, NVIDIA, 2026-02-27)
Technique
At every K turns, SideQuest forks generation into a side thread. That thread inspects open tool outputs via cursor indices, reasons about which are now obsolete relative to the current step, and emits structured deletion commands ({del_cursors: [0]}). Eviction is deferred until the main thread's turn completes (synchronization at turn boundary). The trigger phrase "Memory management mode" keeps management tokens out of the primary context. Staleness is determined by semantic reasoning — "last-use index" — not attention scores.
Central finding: static importance metrics (H2O, SnapKV, R-KV) fail on agentic tasks because tool output utility is non-monotonic — a result can be irrelevant for many steps then become critical again. Model-driven reasoning handles this correctly.
Results (FRAMES + BrowseComp benchmarks)
- Peak token usage: −56% to −65%
- KV cache memory reads: −53% to −71%
- SGLang throughput: +83.9% (1523 vs 828 tok/s)
- End-to-end runtime: −36.8%
- Accuracy degradation: ≤2% in-distribution, ~5% out-of-distribution
- Non-completion rate: matches uncompressed baseline (heuristics show high non-completion at comparable budgets)
Applicability to Zeph
HIGH. Zeph's existing compaction is oldest-first pruning in zeph-memory. SideQuest's cursor-based semantic eviction is a direct replacement for the pruning heuristic in the compaction layer. Implementation path:
- Add cursor index tracking to tool output storage (extend
MessagePart::ToolResultwith a cursor index) - Introduce a side-thread compaction step in the agent loop (fires every K turns, generates deletion decisions via a short LLM call)
- Evict at turn boundary before context assembly
- No architectural changes to agent loop, context builder, or memory — this is an extension of the existing compaction trigger logic
Complements and supersedes oldest-first pruning in zeph-memory; extends #1824 (KVzip importance scoring) with semantic eviction. Does not conflict with #1851 (SWE-Pruner / COMI MIG) — can run together with different triggers.
Implementation sketch
- New config key:
[memory.sidequest] enabled = false, interval_turns = 4 - LLM call budget: one short reasoning call per K turns (trivial vs. typical task depth)
- Cursor indexing:
tool_output_cursors: Vec<usize>tracked inContextBuilder - Eviction primitive:
drop_tool_outputs_by_cursor(cursors: &[usize])inContextBuilder