research(context): SideQuest model-driven KV cache management for long-horizon agentic tasks

## Summary

Model-driven KV cache compression that uses the LRM itself (not attention heuristics) to identify and evict stale tool outputs during long agentic tasks.

**Source**: arXiv 2602.22603 — "SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning" (Kariyappa & Suh, NVIDIA, 2026-02-27)

## Technique

At every K turns, SideQuest forks generation into a side thread. That thread inspects open tool outputs via cursor indices, reasons about which are now obsolete relative to the current step, and emits structured deletion commands (`{del_cursors: [0]}`). Eviction is deferred until the main thread's turn completes (synchronization at turn boundary). The trigger phrase "Memory management mode" keeps management tokens out of the primary context. Staleness is determined by semantic reasoning — "last-use index" — not attention scores.

Central finding: static importance metrics (H2O, SnapKV, R-KV) fail on agentic tasks because tool output utility is non-monotonic — a result can be irrelevant for many steps then become critical again. Model-driven reasoning handles this correctly.

## Results (FRAMES + BrowseComp benchmarks)

- Peak token usage: **−56% to −65%**
- KV cache memory reads: **−53% to −71%**
- SGLang throughput: **+83.9%** (1523 vs 828 tok/s)
- End-to-end runtime: **−36.8%**
- Accuracy degradation: ≤2% in-distribution, ~5% out-of-distribution
- Non-completion rate: matches uncompressed baseline (heuristics show high non-completion at comparable budgets)

## Applicability to Zeph

HIGH. Zeph's existing compaction is oldest-first pruning in `zeph-memory`. SideQuest's cursor-based semantic eviction is a direct replacement for the pruning heuristic in the compaction layer. Implementation path:

1. Add cursor index tracking to tool output storage (extend `MessagePart::ToolResult` with a cursor index)
2. Introduce a side-thread compaction step in the agent loop (fires every K turns, generates deletion decisions via a short LLM call)
3. Evict at turn boundary before context assembly
4. No architectural changes to agent loop, context builder, or memory — this is an extension of the existing compaction trigger logic

Complements and supersedes oldest-first pruning in `zeph-memory`; extends `#1824` (KVzip importance scoring) with semantic eviction. Does not conflict with `#1851` (SWE-Pruner / COMI MIG) — can run together with different triggers.

## Implementation sketch

- New config key: `[memory.sidequest] enabled = false, interval_turns = 4`
- LLM call budget: one short reasoning call per K turns (trivial vs. typical task depth)
- Cursor indexing: `tool_output_cursors: Vec<usize>` tracked in `ContextBuilder`
- Eviction primitive: `drop_tool_outputs_by_cursor(cursors: &[usize])` in `ContextBuilder`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(context): SideQuest model-driven KV cache management for long-horizon agentic tasks #1885

Summary

Technique

Results (FRAMES + BrowseComp benchmarks)

Applicability to Zeph

Implementation sketch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(context): SideQuest model-driven KV cache management for long-horizon agentic tasks #1885

Description

Summary

Technique

Results (FRAMES + BrowseComp benchmarks)

Applicability to Zeph

Implementation sketch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions