Fix/context window explosion by slysian · Pull Request #243 · RightNow-AI/openfang

slysian · 2026-03-03T02:48:13Z

Summary

Tool-heavy agent sessions (web_search + web_fetch) bloat to 100K+ tokens and cause 100+ second response times because defaults are too generous. This PR tightens 4 parameters as defense-in-depth:

web_fetch max_chars: 50,000 → 12,000 — 12K after HTML→markdown is still a full article
Context budget ratios: per-result cap 30%→8%, single-result max 50%→15%, total headroom 75%→40%
MAX_HISTORY_MESSAGES: 20 → 12 — 12 messages ≈ 3 tool iterations, sufficient for continuity
Auto-compaction trigger: threshold 30→15, keep_recent 10→6 — triggers LLM summarization earlier

Before vs After (200K context window)

Metric	Before	After
Max per web_fetch result	50K chars	12K chars
Per-result budget cap	120K chars	32K chars
5 tool iterations total	250K+ chars	~30K chars
Estimated tokens	75-125K	10-15K
Response time	100+ seconds	15-25 seconds

Root cause

web_fetch returns up to 50K chars per call
Agent runs 5+ iterations of tool calls, each adding tool_use + tool_result messages
MAX_HISTORY_MESSAGES=20 trims by count, but 20 messages × 50K tool results = 500K+ chars
Context budget caps are too generous (per-result = 120K chars, headroom = 300K chars)
Auto-compaction triggers at 30 messages, but token limits are hit well before that

Test plan

All existing tests updated and passing
Built and deployed on production instance
Verified services start cleanly
Send bot a query triggering web_search + web_fetch, confirm response time < 30s
Verify [TRUNCATED: markers appear in logs (context budget working)
Verify compaction triggers at 15 messages (check for compaction log lines)

Replace all unsafe byte-index string slicing (`&s[..N]`) with `str::floor_char_boundary()` across 18 files in runtime, kernel, api, memory, channels, and cli crates. These cause panics when the byte index falls inside a multi-byte UTF-8 character (common with CJK text from QQ/Telegram users). The crash was first observed when web_fetch returned Chinese web content that was truncated at a 3-byte character boundary. Affected hot paths: - context_budget: tool result truncation - context_overflow: overflow recovery truncation - compactor: conversation text tail-keeping - stream_chunker: forced break at max_chunk_chars - web_fetch: HTTP response body truncation - kernel: session topic & identity file truncation - docker_sandbox: stdout/stderr truncation - tool_runner: command/url logging, canvas_id slicing - provider_health: error body truncation - subprocess_sandbox: command logging - session_repair: injection marker stripping (to_lowercase byte mismatch) - cron/triggers: error message & content truncation - session (memory): thinking text truncation - TUI screens: ID/value display truncation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Create `supervisor.rs` with centralized reconnection/backoff infrastructure and refactor 23 channel adapters to use it, eliminating ~10,000 lines of duplicated reconnection logic. The supervisor module provides: - `SupervisorConfig`: configurable initial/max backoff (default 1s/60s) - `run_supervised_loop()`: generic supervised reconnection loop - `run_supervised_loop_reset_on_connect()`: variant that resets backoff after successful connection - `DEFAULT_CHANNEL_BUFFER` (256): shared constant replacing hardcoded sizes Each adapter's inline reconnection loop is extracted into a standalone `async fn` that returns `Result<bool, String>`: - `Ok(true)` = reconnect (transient failure) - `Ok(false)` = permanent stop (shutdown or channel closed) - `Err(msg)` = retry with backoff Refactored adapters: telegram, discord, slack, irc, mattermost, revolt, matrix, mastodon, zulip, bluesky, twitch, guilded, gotify, gitter, discourse, ntfy, nostr, webex, twist, mumble, nextcloud, reddit, keybase, linkedin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ules Break the single routes.rs god file containing 174 handler functions into domain-specific modules under routes/: agents.rs (1,551 lines, 24 handlers) channels.rs (1,220 lines, 7 handlers) hands.rs (668 lines, 11 handlers) models.rs (554 lines, 8 handlers) skills.rs (489 lines, 9 handlers) sessions.rs (384 lines, 10 handlers) common.rs (330 lines, shared types/helpers) + 22 smaller domain modules Backward compatibility is maintained via `pub use` re-exports in mod.rs, so server.rs and test files continue referencing `routes::handler_name` without changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace 12 instances of `id.to_string()[..8]` with `id.to_string().get(..8).unwrap_or(&id_str)` across kernel.rs (3) and channel_bridge.rs (9). While UUID Display always produces 36-char ASCII strings making [..8] currently safe, .get() is defensive against any future Display changes and is consistent with the floor_char_boundary() hardening in the rest of the codebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace hardcoded Whisper model names and API endpoints with LazyLock statics that read from environment variables at first use: - GROQ_STT_MODEL (default: whisper-large-v3-turbo) - GROQ_STT_URL (default: api.groq.com/openai/v1/audio/transcriptions) - OPENAI_STT_MODEL (default: whisper-1) - OPENAI_STT_URL (default: api.openai.com/v1/audio/transcriptions) This allows users to swap STT models or providers without recompiling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tool-heavy agent sessions (web_search + web_fetch) bloat to 100K+ tokens because defaults are too generous: web_fetch returns 50K chars per call, per-result budget cap allows 120K chars (30% of 200K context), and auto-compaction only triggers at 30 messages. With 5+ tool iterations, total context easily exceeds the model's window, causing 100+ second response times and downstream timeouts. Four targeted changes (defense in depth): 1. web_fetch max_chars: 50,000 → 12,000 12K chars after HTML→markdown is still a full article. The previous 50K default was far more than any LLM can usefully process per result. 2. Context budget ratios tightened: - per_result_cap: 30% → 8% (120K → 32K chars on 200K window) - single_result_max: 50% → 15% (200K → 60K chars) - total_headroom: 75% → 40% (300K → 160K chars) 3. MAX_HISTORY_MESSAGES: 20 → 12 12 messages ≈ 3 tool iterations or 6 user/assistant exchanges. Previous value of 20 allowed 500K+ chars of tool results to accumulate. 4. Auto-compaction trigger: threshold 30 → 15, keep_recent 10 → 6 Triggers LLM-based summarization earlier, preserving key context instead of blindly dropping messages when the window overflows. Expected impact: estimated tokens per session drop from 75-125K to 10-15K, response time from 100+ seconds to 15-25 seconds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

slysian and others added 6 commits March 2, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/context window explosion#243

Fix/context window explosion#243
slysian wants to merge 6 commits intoRightNow-AI:mainfrom
slysian:fix/context-window-explosion

slysian commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

slysian commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before vs After (200K context window)

Root cause

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slysian commented Mar 3, 2026 •

edited

Loading