fix: KV-Router: degrade to empty overlaps when indexer offline #4428

vladnosiv · 2025-11-18T11:32:47Z

Overview:

Now, when indexer is unavailable (for example, due to a bug here #4394) frontend doesn't process any requests, ending them with a status of 500.

Since determining the optimal instance is a best-effort operation, this is not expected.

Details:

As part of this PR, if the indexer is unavailable, the behavior degrades to the case when nothing is cached.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Partially Resolves: #4394

Summary by CodeRabbit

Bug Fixes
- Improved system stability by gracefully handling offline indexer components; empty results now returned instead of errors when indexer becomes unavailable.
Tests
- Added test coverage to verify consistent error handling when indexer components are offline or disconnected.

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

copy-pr-bot · 2025-11-18T11:32:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-11-18T11:32:56Z

👋 Hi vladnosiv! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-11-18T11:37:12Z

Walkthrough

Changes to the KV router indexer error handling replace panic-prone operations and error propagation with graceful degradation—returning empty results with warning logs when the indexer is offline or requests fail. Tests added to verify empty OverlapScores are returned when offline.

Changes

Cohort / File(s)	Summary
KV Indexer error handling & tests `lib/llm/src/kv_router/indexer.rs`	Updated error handling to log warnings and return empty OverlapScores instead of propagating errors when match requests fail, responses fail to await, or broadcast operations fail. Added tests verifying KvIndexer and KvIndexerSharded return empty OverlapScores after shutdown (offline state).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Verify the error handling change from panic/error propagation to graceful empty-result returns aligns with system reliability goals
Confirm test coverage adequately validates offline behavior for both single and sharded indexer variants
Review warning log messages for clarity and debuggability

Poem

A rabbit once crashed when blocks went awry,
With RefCells borrowed and panics so nigh.
But warnings and grace saved the day with a bound,
Now empty results, not crashes, abound! 🐰✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main change: KV-Router degrades gracefully by returning empty overlaps when the indexer is offline instead of propagating errors.
Description check	✅ Passed	The description follows the template with all required sections (Overview, Details, Related Issues) and provides clear context about the problem, solution, and related issue.
Linked Issues check	✅ Passed	The PR partially addresses issue #4394 by implementing graceful degradation when the indexer is offline (falling back to empty overlaps), which directly supports the objective of maintaining frontend availability even when KV routing is unavailable.
Out of Scope Changes check	✅ Passed	All changes are scoped to the indexer offline behavior: error handling modifications, control flow adjustments for broadcast failures, and added tests for offline scenarios. These are directly related to the PR objective and linked issue requirements.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc2ad4e and ac027b9.

📒 Files selected for processing (1)

lib/llm/src/kv_router/indexer.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: tests/router/test_router_e2e_with_mockers.py:961-974
Timestamp: 2025-08-29T09:53:45.266Z
Learning: Indexer dumps in the KV router system are designed to never contain remove or clear events - they only contain "stored" events. Therefore, code that processes indexer dump events can safely assume the presence of event["event"]["data"]["stored"] structure without additional error handling for other event types.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

📚 Learning: 2025-05-30T06:38:09.630Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-06-05T01:02:15.318Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-10-14T00:58:05.744Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-05-29T00:02:35.018Z

Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-17T20:55:06.333Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-11T03:24:47.820Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

lib/llm/src/kv_router/indexer.rs

🧬 Code graph analysis (1)

lib/llm/src/kv_router/indexer.rs (3)

lib/bindings/python/src/dynamo/_core.pyi (8)

OverlapScores (483-508)

scores (490-497)

metrics (88-95)

metrics (117-124)

metrics (160-167)

CancellationToken (60-71)

KvIndexer (583-619)

frequencies (500-508)

lib/llm/src/kv_router.rs (3)

new (134-155)

new (221-314)

new (486-491)

lib/llm/src/kv_router/approx.rs (3)

new (124-132)

new (268-435)

sequence (469-473)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build and Test - dynamo
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: tests (.)
GitHub Check: clippy (.)

🔇 Additional comments (4)

lib/llm/src/kv_router/indexer.rs (4)

996-1002: LGTM! Graceful degradation aligns with best-effort semantics.

The change correctly handles indexer unavailability by returning empty overlap scores with a warning instead of propagating an error. This aligns with the PR objective that choosing an optimal instance is a best-effort operation and should not cause frontend failures.

1004-1013: LGTM! Consistent graceful degradation for response failures.

The receive failure handling is consistent with the send failure handling above. Both cases now degrade gracefully by returning empty scores when the indexer is offline, preventing cascading failures to the frontend.

1234-1244: LGTM! Sharded implementation consistent with non-sharded version.

The broadcast failure handling is correctly applied to the sharded indexer, maintaining consistency with the non-sharded KvIndexer implementation. When the broadcast fails (all shards offline or channel closed), returning empty scores immediately is the right approach.

2159-2187: LGTM! Good test coverage for the new behavior.

The tests correctly verify that both KvIndexer and KvIndexerSharded return empty OverlapScores (with empty scores, frequencies, and tree_sizes) when the indexer is offline. This provides good coverage of the graceful degradation behavior.

lib/llm/src/kv_router/indexer.rs

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

grahamking · 2025-11-18T16:23:05Z

@PeaBrane Could you take a look?

grahamking · 2025-11-18T16:23:56Z

/ok to test 2c4c8d8

lib/llm/src/kv_router/approx.rs

lib/llm/src/kv_router/indexer.rs

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

degrade to empty overlaps when indexer offline

ac027b9

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv requested a review from a team as a code owner November 18, 2025 11:32

pull-request-size bot added the size/M label Nov 18, 2025

github-actions bot added external-contribution Pull request is from an external contributor fix labels Nov 18, 2025

vladnosiv mentioned this pull request Nov 18, 2025

[BUG]: KV-Router indexer receives invalid self-referential blocks #4394

Closed

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/indexer.rs Show resolved Hide resolved

add symmetric code for approx indexer

2c4c8d8

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

pull-request-size bot added size/L and removed size/M labels Nov 18, 2025

grahamking requested a review from PeaBrane November 18, 2025 16:23

copy-pr-bot bot temporarily deployed to GITLAB November 18, 2025 16:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 18, 2025 16:28 Inactive

PeaBrane reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/approx.rs Outdated Show resolved Hide resolved

PeaBrane reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/indexer.rs Outdated Show resolved Hide resolved

PeaBrane reviewed Nov 18, 2025

View reviewed changes

lib/llm/src/kv_router/indexer.rs Show resolved Hide resolved

vladnosiv added 2 commits November 19, 2025 12:32

error level

822a5c2

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

add panic tests

3a050e7

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv requested a review from a team as a code owner November 19, 2025 12:46

vladnosiv added 3 commits November 19, 2025 15:54

fix fmt

f8dd307

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

fix clippy

2a01ab9

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

revert panic tests

27fe75c

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv force-pushed the indexer-is-best-effort branch from bd86ea3 to 27fe75c Compare November 20, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: KV-Router: degrade to empty overlaps when indexer offline #4428

fix: KV-Router: degrade to empty overlaps when indexer offline #4428

vladnosiv commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

coderabbitai bot commented Nov 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

grahamking commented Nov 18, 2025

Uh oh!

grahamking commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: KV-Router: degrade to empty overlaps when indexer offline #4428

Are you sure you want to change the base?

fix: KV-Router: degrade to empty overlaps when indexer offline #4428

Conversation

vladnosiv commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

coderabbitai bot commented Nov 18, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grahamking commented Nov 18, 2025

Uh oh!

grahamking commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vladnosiv commented Nov 18, 2025 •

edited by coderabbitai bot

Loading