fix(opencode): reduce memory usage during prompting with lazy boundary scan and context windowing#18137
Open
BYK wants to merge 1 commit intoanomalyco:devfrom
Open
fix(opencode): reduce memory usage during prompting with lazy boundary scan and context windowing#18137BYK wants to merge 1 commit intoanomalyco:devfrom
BYK wants to merge 1 commit intoanomalyco:devfrom
Conversation
Two optimizations to drastically reduce memory during prompting: 1. filterCompactedLazy: probe newest 50 message infos (1 query, no parts) to detect compaction. If none found, fall back to original single-pass filterCompacted(stream()) — avoids 155+ wasted info-only queries for uncompacted sessions. Compacted sessions still use the efficient two-pass scan. 2. Context-window windowing: before calling toModelMessages, estimate which messages from the tail fit in the LLM context window using model.limit.context * 4 chars/token. Only convert those messages to ModelMessage format. For a 7,704-message session where ~200 fit in context, this reduces toModelMessages input from 7,704 to ~200 messages — cutting ~300MB of wrapper objects across 4-5 copy layers down to ~10MB. Also caches conversation across prompt loop iterations — full reload only after compaction, incremental merge for tool-call steps.
Contributor
|
Hey! Your PR title Please update it to start with one of:
Where See CONTRIBUTING.md for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue for this PR
Closes #18136
Type of change
What does this PR do?
Two targeted optimizations to reduce peak RSS during prompting from ~4-8GB down to ~1.2GB:
1. Lazy compaction boundary scan (
filterCompactedLazy)The prompt loop calls
filterCompacted(stream(sessionID))which streams ALL messages newest→oldest, loading parts for every message. For compacted sessions, most of those parts are discarded once the boundary is found.New approach: probe the newest 50 message infos (1 DB query, no parts). If a compaction summary is detected, use a two-phase scan — info-only scan to find the boundary, then hydrate parts only for messages after it. If no compaction summary is found, fall back to the original single-pass
filterCompacted(stream())to avoid wasted info-only queries.2. Context-window message windowing
toModelMessageswas called with ALL messages (e.g., 7,704 for a long session), creating ModelMessage wrapper objects for every one. These flow through 4-5 copy layers (toModelMessages→convertToModelMessages→ProviderTransform.message→convertToLanguageModelPrompt), each creating ~60MB of wrapper objects.Now the prompt loop estimates which messages from the tail fit in the LLM context window (
model.limit.context × 4 chars/token) and only passes those totoModelMessages. For a 7,704-message session where ~200 fit, this cuts the conversion pipeline from ~300MB to ~10MB.3. Prompt loop caching
The conversation is loaded once before the loop. On normal tool-call iterations, only the latest 200-message page is fetched and merged into the cache. Full reload only happens after compaction.
How did you verify your code works?
/proc/<PID>/statusevery 30s during active promptingScreenshots / recordings
Memory monitoring (30s intervals) after fix:
Checklist