Skip to content

Fix/context window explosion#243

Open
slysian wants to merge 6 commits intoRightNow-AI:mainfrom
slysian:fix/context-window-explosion
Open

Fix/context window explosion#243
slysian wants to merge 6 commits intoRightNow-AI:mainfrom
slysian:fix/context-window-explosion

Conversation

@slysian
Copy link

@slysian slysian commented Mar 3, 2026

Summary

Tool-heavy agent sessions (web_search + web_fetch) bloat to 100K+ tokens and cause 100+ second response times because defaults are too generous. This PR tightens 4 parameters as defense-in-depth:

  • web_fetch max_chars: 50,000 → 12,000 — 12K after HTML→markdown is still a full article
  • Context budget ratios: per-result cap 30%→8%, single-result max 50%→15%, total headroom 75%→40%
  • MAX_HISTORY_MESSAGES: 20 → 12 — 12 messages ≈ 3 tool iterations, sufficient for continuity
  • Auto-compaction trigger: threshold 30→15, keep_recent 10→6 — triggers LLM summarization earlier

Before vs After (200K context window)

Metric Before After
Max per web_fetch result 50K chars 12K chars
Per-result budget cap 120K chars 32K chars
5 tool iterations total 250K+ chars ~30K chars
Estimated tokens 75-125K 10-15K
Response time 100+ seconds 15-25 seconds

Root cause

  1. web_fetch returns up to 50K chars per call
  2. Agent runs 5+ iterations of tool calls, each adding tool_use + tool_result messages
  3. MAX_HISTORY_MESSAGES=20 trims by count, but 20 messages × 50K tool results = 500K+ chars
  4. Context budget caps are too generous (per-result = 120K chars, headroom = 300K chars)
  5. Auto-compaction triggers at 30 messages, but token limits are hit well before that

Test plan

  • All existing tests updated and passing
  • Built and deployed on production instance
  • Verified services start cleanly
  • Send bot a query triggering web_search + web_fetch, confirm response time < 30s
  • Verify [TRUNCATED: markers appear in logs (context budget working)
  • Verify compaction triggers at 15 messages (check for compaction log lines)

slysian and others added 6 commits March 2, 2026 13:54
Replace all unsafe byte-index string slicing (`&s[..N]`) with
`str::floor_char_boundary()` across 18 files in runtime, kernel, api,
memory, channels, and cli crates.

These cause panics when the byte index falls inside a multi-byte UTF-8
character (common with CJK text from QQ/Telegram users). The crash was
first observed when web_fetch returned Chinese web content that was
truncated at a 3-byte character boundary.

Affected hot paths:
- context_budget: tool result truncation
- context_overflow: overflow recovery truncation
- compactor: conversation text tail-keeping
- stream_chunker: forced break at max_chunk_chars
- web_fetch: HTTP response body truncation
- kernel: session topic & identity file truncation
- docker_sandbox: stdout/stderr truncation
- tool_runner: command/url logging, canvas_id slicing
- provider_health: error body truncation
- subprocess_sandbox: command logging
- session_repair: injection marker stripping (to_lowercase byte mismatch)
- cron/triggers: error message & content truncation
- session (memory): thinking text truncation
- TUI screens: ID/value display truncation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create `supervisor.rs` with centralized reconnection/backoff infrastructure
and refactor 23 channel adapters to use it, eliminating ~10,000 lines of
duplicated reconnection logic.

The supervisor module provides:
- `SupervisorConfig`: configurable initial/max backoff (default 1s/60s)
- `run_supervised_loop()`: generic supervised reconnection loop
- `run_supervised_loop_reset_on_connect()`: variant that resets backoff
  after successful connection
- `DEFAULT_CHANNEL_BUFFER` (256): shared constant replacing hardcoded sizes

Each adapter's inline reconnection loop is extracted into a standalone
`async fn` that returns `Result<bool, String>`:
- `Ok(true)` = reconnect (transient failure)
- `Ok(false)` = permanent stop (shutdown or channel closed)
- `Err(msg)` = retry with backoff

Refactored adapters: telegram, discord, slack, irc, mattermost, revolt,
matrix, mastodon, zulip, bluesky, twitch, guilded, gotify, gitter,
discourse, ntfy, nostr, webex, twist, mumble, nextcloud, reddit,
keybase, linkedin.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ules

Break the single routes.rs god file containing 174 handler functions
into domain-specific modules under routes/:

  agents.rs (1,551 lines, 24 handlers)
  channels.rs (1,220 lines, 7 handlers)
  hands.rs (668 lines, 11 handlers)
  models.rs (554 lines, 8 handlers)
  skills.rs (489 lines, 9 handlers)
  sessions.rs (384 lines, 10 handlers)
  common.rs (330 lines, shared types/helpers)
  + 22 smaller domain modules

Backward compatibility is maintained via `pub use` re-exports in
mod.rs, so server.rs and test files continue referencing
`routes::handler_name` without changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 12 instances of `id.to_string()[..8]` with
`id.to_string().get(..8).unwrap_or(&id_str)` across kernel.rs (3) and
channel_bridge.rs (9).

While UUID Display always produces 36-char ASCII strings making [..8]
currently safe, .get() is defensive against any future Display changes
and is consistent with the floor_char_boundary() hardening in the rest
of the codebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded Whisper model names and API endpoints with LazyLock
statics that read from environment variables at first use:

- GROQ_STT_MODEL (default: whisper-large-v3-turbo)
- GROQ_STT_URL (default: api.groq.com/openai/v1/audio/transcriptions)
- OPENAI_STT_MODEL (default: whisper-1)
- OPENAI_STT_URL (default: api.openai.com/v1/audio/transcriptions)

This allows users to swap STT models or providers without recompiling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tool-heavy agent sessions (web_search + web_fetch) bloat to 100K+ tokens
because defaults are too generous: web_fetch returns 50K chars per call,
per-result budget cap allows 120K chars (30% of 200K context), and
auto-compaction only triggers at 30 messages. With 5+ tool iterations,
total context easily exceeds the model's window, causing 100+ second
response times and downstream timeouts.

Four targeted changes (defense in depth):

1. web_fetch max_chars: 50,000 → 12,000
   12K chars after HTML→markdown is still a full article. The previous
   50K default was far more than any LLM can usefully process per result.

2. Context budget ratios tightened:
   - per_result_cap: 30% → 8% (120K → 32K chars on 200K window)
   - single_result_max: 50% → 15% (200K → 60K chars)
   - total_headroom: 75% → 40% (300K → 160K chars)

3. MAX_HISTORY_MESSAGES: 20 → 12
   12 messages ≈ 3 tool iterations or 6 user/assistant exchanges.
   Previous value of 20 allowed 500K+ chars of tool results to accumulate.

4. Auto-compaction trigger: threshold 30 → 15, keep_recent 10 → 6
   Triggers LLM-based summarization earlier, preserving key context
   instead of blindly dropping messages when the window overflows.

Expected impact: estimated tokens per session drop from 75-125K to
10-15K, response time from 100+ seconds to 15-25 seconds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant