Prefix caching | Mamba memory only.#3657
Open
lmcafee-nvidia wants to merge 1 commit intoNVIDIA:mainfrom
Open
Conversation
santhnm2
approved these changes
Mar 3, 2026
kvareddy
approved these changes
Mar 3, 2026
…ring Hybrid models (Transformer + Mamba) lack per-block Mamba states, so prefix computation cannot be skipped. This adds a guard in _compute_prefix_match that forces prefix_skip_tokens=0 when is_hybrid_model is True, ensuring all tokens are recomputed while still sharing KV blocks for memory savings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e6fd0b4 to
cc45bb3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_compute_prefix_matchthat forcesprefix_skip_tokens = 0whenis_hybrid_modelisTrue, so matched prefix blocks are still shared (saving memory) but all prompt tokens are still processed through the model (preserving Mamba state correctness).TestHybridModelMemoryOnlyverifying: no prefill skipping, block reuse for memory savings, correct ref counts for shared blocks, and all prompt tokens present in context.Details
When prefix caching is enabled for a hybrid model, the system operates in "memory-only" mode:
The change is a single 3-line guard in
_compute_prefix_match(~line 1624 ofdynamic_context.py):Benchmarked on a 2B hybrid model (23 Mamba + 4 Attention + 23 MLP layers, 50 total) with 10 identical requests (644 tokens each):
Test plan
test_no_prefill_skipping_for_hybrid_model: verifiesprefix_skip_tokens == 0andeffective_chunk_length == chunk_lengtheven when blocks matchtest_matched_blocks_reused_saving_memory: verifies second request consumes no additional blocks from pooltest_ref_counts_incremented_for_matched_blocks: verifies matched blocks haveref_count == 2after sharingtest_all_prompt_tokens_in_context: verifies all prompt tokens are active (none skipped) andkv_length_offset == 0🤖 Generated with Claude Code