Skip to content

research(context): SideQuest model-driven KV cache management for long-horizon agentic tasks #1885

@bug-ops

Description

@bug-ops

Summary

Model-driven KV cache compression that uses the LRM itself (not attention heuristics) to identify and evict stale tool outputs during long agentic tasks.

Source: arXiv 2602.22603 — "SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning" (Kariyappa & Suh, NVIDIA, 2026-02-27)

Technique

At every K turns, SideQuest forks generation into a side thread. That thread inspects open tool outputs via cursor indices, reasons about which are now obsolete relative to the current step, and emits structured deletion commands ({del_cursors: [0]}). Eviction is deferred until the main thread's turn completes (synchronization at turn boundary). The trigger phrase "Memory management mode" keeps management tokens out of the primary context. Staleness is determined by semantic reasoning — "last-use index" — not attention scores.

Central finding: static importance metrics (H2O, SnapKV, R-KV) fail on agentic tasks because tool output utility is non-monotonic — a result can be irrelevant for many steps then become critical again. Model-driven reasoning handles this correctly.

Results (FRAMES + BrowseComp benchmarks)

  • Peak token usage: −56% to −65%
  • KV cache memory reads: −53% to −71%
  • SGLang throughput: +83.9% (1523 vs 828 tok/s)
  • End-to-end runtime: −36.8%
  • Accuracy degradation: ≤2% in-distribution, ~5% out-of-distribution
  • Non-completion rate: matches uncompressed baseline (heuristics show high non-completion at comparable budgets)

Applicability to Zeph

HIGH. Zeph's existing compaction is oldest-first pruning in zeph-memory. SideQuest's cursor-based semantic eviction is a direct replacement for the pruning heuristic in the compaction layer. Implementation path:

  1. Add cursor index tracking to tool output storage (extend MessagePart::ToolResult with a cursor index)
  2. Introduce a side-thread compaction step in the agent loop (fires every K turns, generates deletion decisions via a short LLM call)
  3. Evict at turn boundary before context assembly
  4. No architectural changes to agent loop, context builder, or memory — this is an extension of the existing compaction trigger logic

Complements and supersedes oldest-first pruning in zeph-memory; extends #1824 (KVzip importance scoring) with semantic eviction. Does not conflict with #1851 (SWE-Pruner / COMI MIG) — can run together with different triggers.

Implementation sketch

  • New config key: [memory.sidequest] enabled = false, interval_turns = 4
  • LLM call budget: one short reasoning call per K turns (trivial vs. typical task depth)
  • Cursor indexing: tool_output_cursors: Vec<usize> tracked in ContextBuilder
  • Eviction primitive: drop_tool_outputs_by_cursor(cursors: &[usize]) in ContextBuilder

Metadata

Metadata

Assignees

No one assigned

    Labels

    researchResearch-driven improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions