Skip to content

feat(observability): add logging request level statistics from engine#757

Open
scottjlee wants to merge 42 commits intomainfrom
sjl/0302-unify-request-stats
Open

feat(observability): add logging request level statistics from engine#757
scottjlee wants to merge 42 commits intomainfrom
sjl/0302-unify-request-stats

Conversation

@scottjlee
Copy link
Collaborator

@scottjlee scottjlee commented Mar 13, 2026

Description

Problem

Currently, smg does log or record performance statistics of individual requests (timings, token counts, cache hit rate, etc). It is useful to have these results available at a request level for performance analysis, debugging, etc, as well as for production monitoring.

Solution

  • Adds a unified, engine-agnostic mechanism for fetching and emitting per-request statistics from inference engines, with a concrete SGLang implementation for chat. Other engines can be supported in a future PR.
  • This feature is disabled by default, and can be toggled on with the --enable-request-statistics flag.

Example of request statistics:

2026-03-13 20:38:56 DEBUG smg::observability::events: model_gateway/src/observability/events.rs:207: request_stats request_id=chatcmpl-019ce8ec-6b83-7a80-b90e-6f54e1a4a10d model=meta-llama/Llama-3.1-8B-Instruct router_backend=regular engine=sglang request_received_timestamp_s=1773434334.2372427 first_token_generated_timestamp_s=1773434334.252578 request_finished_timestamp_s=1773434336.2970366 response_sent_timestamp_s=1773434334.2526355 cache_hit_rate=0.9761904761904762 prompt_tokens=42 completion_tokens=908 cached_tokens=41

Changes

  • Add a new GetRequestStats RPC and RequestStats message to the SGLang scheduler proto, which carries request statistics (timestamps, token counts, cache hit rate, spec decoding acceptance rate).
  • Add a new UnifiedRequestStats struct which represents engine-neutral statistics for each request; this unifies the request statistics from different engines into a single common format, making it easy for downstream users to parse (e.g. ingesting into data pipeline, analytics platform, etc).
  • Add a new StatsProtoStream struct which wraps the existing ProtoStream, used for additionally tracking request metadata used for stats fetching. This is passed to RequestExecutionStage exactly as before, with optional stats fetching metadata added.
  • Add a new config which enables this feature, via the --enable-request-statistics flag.

The flow of how request stats are fetched:

  • Request completes
  • StatsProtoStream::spawn_stats_emission() is called in StreamingProcessor (streaming case) or collect_responses() (non-streaming case). If enable_request_statistics is enabled, call ProtoStream::spawn_stats_emission() which matches on the engine type; otherwise, skip stats collection.
  • For the matched engine type (currently only SGLang), spawn a fire-and-forget tokio::spawn task which calls SglangSchedulerClient::get_request_stats() via the GetRequestStats RPC.
  • Collect one RequestStats per sample in the request, merge them together, and produce the standardized UnifiedRequestStats.
  • Emit a RequestStatsEvent with the resulting UnifiedRequestStats.

Test Plan

Tested locally with meta-llama/Llama-3.1-8B-Instruct and SGLang. Sample request with n=3 samples:

curl -N http://127.0.0.1:30000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role":"user","content":"Count 1 to 100 quickly."}],
    "stream": true,
    "max_tokens": 1024, "n": 3
  }'

Resulting logged request statistics:

2026-03-13 20:38:56 DEBUG smg::observability::events: model_gateway/src/observability/events.rs:207: request_stats request_id=chatcmpl-019ce8ec-6b83-7a80-b90e-6f54e1a4a10d model=meta-llama/Llama-3.1-8B-Instruct router_backend=regular engine=sglang request_received_timestamp_s=1773434334.2372427 first_token_generated_timestamp_s=1773434334.252578 request_finished_timestamp_s=1773434336.2970366 response_sent_timestamp_s=1773434334.2526355 cache_hit_rate=0.9761904761904762 prompt_tokens=42 completion_tokens=908 cached_tokens=41
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features
    • Per-request statistics (timestamps, token counts, cache hit rate, speculative-decoding rate) and retrieval by request ID.
    • New RPC to query request metrics.
    • Config flag and CLI option to enable request-level statistics.
    • Automatic emission of aggregated request stats into observability/metrics, including across streaming responses and backend-labeled pipelines.

scottjlee and others added 30 commits March 4, 2026 17:07
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant observability enhancement by enabling the collection and logging of detailed performance statistics for individual inference requests. It establishes a foundational framework that is engine-agnostic, with initial support for SGLang, to capture critical metrics such as request timings, token counts, and cache hit rates. This data is crucial for performance analysis, debugging, and production monitoring. The feature is designed to be opt-in, controlled by a new command-line flag, ensuring it only impacts users who explicitly enable it.

Highlights

  • Request Statistics Collection: Implemented a unified, engine-agnostic mechanism to fetch and emit per-request performance statistics from inference engines, starting with SGLang.
  • SGLang gRPC API Extension: Added a new GetRequestStats RPC and RequestStats message to the SGLang scheduler proto to retrieve detailed request-level metrics.
  • Unified Data Structure: Introduced UnifiedRequestStats to standardize performance metrics across different engines, facilitating downstream analysis and monitoring.
  • Feature Toggle: Added a new configuration flag, --enable-request-statistics, allowing users to enable or disable this observability feature.
  • Stream Wrapper for Stats: Created StatsProtoStream to wrap existing ProtoStream instances, enabling the collection and emission of request statistics upon stream completion.
Changelog
  • crates/grpc_client/proto/sglang_scheduler.proto
    • Added GetRequestStats RPC to the SglangScheduler service.
    • Defined GetRequestStatsRequest and RequestStats messages for per-request statistics.
    • Defined GetRequestStatsResponse to return a list of RequestStats.
  • crates/grpc_client/src/sglang_scheduler.rs
    • Added request_id and scheduler_client accessors to AbortOnDropStream.
    • Implemented get_request_stats method in SglangSchedulerClient to call the new RPC.
  • model_gateway/src/config/builder.rs
    • Added enable_request_statistics builder method to RouterConfigBuilder.
  • model_gateway/src/config/types.rs
    • Added enable_request_statistics field to RouterConfig struct, defaulting to false.
  • model_gateway/src/main.rs
    • Added --enable-request-statistics CLI argument.
    • Mapped the new CLI argument to the RouterConfigBuilder.
  • model_gateway/src/observability/events.rs
    • Introduced UnifiedRequestStats struct for engine-agnostic request statistics.
    • Added maybe_emit_event and merge methods to UnifiedRequestStats.
    • Defined opt_min and opt_max helper functions for merging optional f64 values.
    • Created RequestStatsEvent struct and emit_request_stats! macro for logging.
    • Added unit tests for UnifiedRequestStats merging and event emission.
  • model_gateway/src/routers/grpc/common/response_collection.rs
    • Updated imports to use StatsProtoStream.
    • Modified collect_responses to call spawn_stats_emission on StatsProtoStream after completion.
    • Changed collect_stream_responses to accept StatsProtoStream.
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
    • Imported StatsProtoStream.
    • Added router_backend field to RequestExecutionStage.
    • Modified RequestExecutionStage::new to accept router_backend.
    • Updated execute_single, execute_dual_dispatch, and execute_sequential_pd to return ProtoStream directly, then wrap them in StatsProtoStream within the apply method.
    • Wrapped ProtoStream instances with StatsProtoStream before returning ExecutionResult.
  • model_gateway/src/routers/grpc/context.rs
    • Updated imports to use StatsProtoStream.
    • Added enable_request_statistics field to SharedComponents.
    • Changed ExecutionResult enum variants to use StatsProtoStream instead of ProtoStream.
  • model_gateway/src/routers/grpc/harmony/streaming.rs
    • Updated imports to use StatsProtoStream.
    • Modified process_single_stream, process_dual_stream, process_chat_decode_stream, process_responses_dual_stream, and process_decode_stream to accept StatsProtoStream.
    • Added spawn_stats_emission call for decode_stream in process_chat_decode_stream.
  • model_gateway/src/routers/grpc/pd_router.rs
    • Retrieved enable_request_statistics from ctx.router_config.
    • Passed enable_request_statistics to SharedComponents.
  • model_gateway/src/routers/grpc/pipeline.rs
    • Modified RequestExecutionStage::new calls across various pipelines to include router_backend and metrics_labels::BACKEND_REGULAR or metrics_labels::BACKEND_PD.
  • model_gateway/src/routers/grpc/proto_wrapper.rs
    • Imported UnifiedRequestStats.
    • Added unified_stats_from_sglang function to convert SGLang RequestStats to UnifiedRequestStats.
    • Implemented spawn_stats_emission in ProtoStream for SGLang, which fetches stats and emits an event.
    • Introduced StatsProtoStream struct to wrap ProtoStream and manage stats emission logic, including a spawn_stats_emission method.
    • Added unit tests for unified_stats_from_sglang.
  • model_gateway/src/routers/grpc/regular/streaming.rs
    • Updated imports to use StatsProtoStream.
    • Modified process_streaming_chunks, process_dual_streaming_chunks, process_generate_streaming, process_generate_streaming_dual, and process_generate_streaming_with_input_logprobs to accept StatsProtoStream.
    • Added spawn_stats_emission calls for grpc_stream and stream in various processing functions.
  • model_gateway/src/routers/grpc/router.rs
    • Retrieved enable_request_statistics from ctx.router_config.
    • Passed enable_request_statistics to SharedComponents.
Activity
  • The pull request was created by scottjlee.
  • The pull request description outlines the problem, solution, changes, and test plan for the new observability feature.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable observability feature for logging request-level statistics from the inference engine. The implementation is well-designed, using a feature flag for easy toggling, a unified data structure for engine-agnostic stats, and a non-blocking approach for fetching the data. The code is clean, well-tested, and the changes are consistently applied across the codebase. I have one suggestion to make the code even more idiomatic.

Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
@scottjlee scottjlee marked this pull request as ready for review March 18, 2026 18:28
@mergify
Copy link

mergify bot commented Mar 18, 2026

Hi @scottjlee, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5135d1295f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +404 to +406
self._store_request_stats(
request_id,
final_response,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use the yielded streaming response when persisting stats

final_response is only assigned inside the if not is_stream branch above, so on every stream=true request this call raises UnboundLocalError as soon as the finishing chunk arrives. Because _store_request_stats(...) is unconditional, the regression is not gated by --enable-request-statistics: successful SGLang streams can now terminate with an INTERNAL error at the end of the response instead of closing cleanly.

Useful? React with 👍 / 👎.

Comment on lines +150 to +151
if let Some(rate) = other.spec_decoding_acceptance_rate {
self.spec_decoding_acceptance_rate.get_or_insert(rate);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate per-sample acceptance rates before logging

ProtoStream::spawn_stats_emission() reduces one RequestStats per sample into a single UnifiedRequestStats, but these lines keep the first sample's spec_decoding_acceptance_rate instead of combining the values. For n>1 requests with speculative decoding enabled, different samples can have very different acceptance rates, so the emitted request_stats event reports an arbitrary sample-0 number while completion_tokens has already been summed across all samples.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't sure what would be the most appropriate aggregation way for spec acceptance rate across multiple samples, so i just kept the first value. maybe max is most appropriate?

Signed-off-by: Scott Lee <scott@together.ai>
@mergify
Copy link

mergify bot commented Mar 18, 2026

Hi @scottjlee, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5db99ddb4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 1517 to +1519
pub async fn process_messages_streaming_chunks(
&self,
mut grpc_stream: ProtoStream,
mut grpc_stream: StatsProtoStream,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit request stats from Messages API streams

After switching this handler to StatsProtoStream, it still only calls mark_completed() before returning; unlike the other streaming handlers, it never calls spawn_stats_emission(). With --enable-request-statistics, every successful streaming /v1/messages request therefore finishes without a request_stats event, so the new observability feature silently does nothing for Anthropic-style streaming traffic.

Useful? React with 👍 / 👎.

Comment on lines 526 to +528
/// Process decode stream for tool call events.
async fn process_decode_stream(
mut decode_stream: ProtoStream,
mut decode_stream: StatsProtoStream,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit request stats from Harmony Responses streams

This decode path now accepts StatsProtoStream, but after the stream is marked completed it returns without ever invoking spawn_stats_emission(). When --enable-request-statistics is on, successful streaming /v1/responses requests on the Harmony backend—including tool-call iterations—will never log the new request-level stats event.

Useful? React with 👍 / 👎.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
model_gateway/src/routers/grpc/regular/streaming.rs (2)

1517-1525: ⚠️ Potential issue | 🟠 Major

Messages streaming never uses the stats wrapper.

This helper now accepts StatsProtoStream, but it returns after grpc_stream.mark_completed() on Line 2094 without ever calling grpc_stream.spawn_stats_emission(). --enable-request-statistics is therefore a no-op for the Messages streaming endpoint, unlike the chat and generate paths in this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1517 -
1525, The process_messages_streaming_chunks function accepts a StatsProtoStream
but never starts stats emission: before calling grpc_stream.mark_completed()
(and before returning) invoke grpc_stream.spawn_stats_emission() so the request
statistics are emitted for the Messages streaming endpoint; ensure
spawn_stats_emission is called (and awaited or left to run as its API requires)
prior to or immediately after mark_completed in
process_messages_streaming_chunks so --enable-request-statistics is not a no-op.

179-185: ⚠️ Potential issue | 🟠 Major

Finalize StatsProtoStream before the trailer SSE writes.

After the read loop is exhausted, phases 3-5 still do several fallible tx.send(...) calls. If the client disconnects there, control returns before Line 547 and Line 563, so a fully-consumed backend stream is still dropped as aborted and never schedules request-stats emission. Please move mark_completed() / spawn_stats_emission() into the post-read path before those trailer writes.

Also applies to: 563-563

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 179 - 185,
The finalization of StatsProtoStream must occur immediately after the read loop
completes and before any fallible tx.send(...) trailer writes: call
grpc_stream.mark_completed() and then context::spawn_stats_emission(grpc_stream,
dispatch, original_request) (or the existing spawn_stats_emission helper you
use) in the post-read path — before performing the subsequent tx.send(...)
trailer SSE writes — so that a fully-consumed backend stream is finalized and
stats emission is scheduled even if a client disconnects during the trailer
sends; update the code paths that currently call
mark_completed()/spawn_stats_emission() after the tx.send(...) calls (including
the branch covering the second trailer) to instead finalize and spawn emission
first, then proceed with tx.send(...) and handle send errors as before.
model_gateway/src/routers/grpc/harmony/streaming.rs (1)

527-533: ⚠️ Potential issue | 🟠 Major

Emit request stats from the Harmony Responses helper.

process_decode_stream() now accepts StatsProtoStream, but it still stops at mark_completed() on Line 1010. Both single- and dual-stream Responses requests funnel through this helper, so --enable-request-statistics never emits request-level stats for Harmony Responses streaming.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/streaming.rs` around lines 527 - 533,
process_decode_stream currently takes a StatsProtoStream but never emits
request-level stats because it stops at the existing mark_completed() call (see
mark_completed reference in this function); update the end-of-stream/cleanup
logic in process_decode_stream so that when the incoming StatsProtoStream
contains request statistics you call the request-level stats
emitter/mark_completed path (i.e., invoke the same request-statistics completion
that the dual-stream Responses code uses) before returning, ensuring request
stats are emitted for both single- and dual-stream Harmony Responses; look for
process_decode_stream, StatsProtoStream and mark_completed to add the
conditional completion call and emit the request-level stats.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/servicer.py`:
- Around line 562-577: Replace the current error-handling in GetRequestStats to
follow the servicer-wide convention: after logging the exception (using
logger.error with get_exception_traceback()), call await
context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of
context.set_code/context.set_details and returning an empty response; locate the
GetRequestStats method and change its except block accordingly (references:
GetRequestStats, request_manager.get_request_stats, _build_request_stats,
context.abort). Also apply the same fix to GetLoads so both methods use await
context.abort(...) on exceptions.

In `@model_gateway/src/observability/events.rs`:
- Around line 147-152: The current merge uses get_or_insert for
spec_decoding_acceptance_rate, which preserves the first sample's per-sample
rate; instead compute a deterministic completion_tokens-weighted average using
self.completion_tokens and other.completion_tokens and set
self.spec_decoding_acceptance_rate = Some(weighted_rate). If one side is None,
fall back to the existing Some; if both are Some but total_tokens == 0, pick a
deterministic choice (e.g., average the two or keep self). Update the merge
logic in the method that handles merging events (the code around
spec_decoding_acceptance_rate and completion_tokens) to compute and assign the
weighted value rather than calling get_or_insert.

In `@model_gateway/src/routers/grpc/pipeline.rs`:
- Around line 129-132: The pipeline tags RequestExecutionStage with
BACKEND_HARMONY but new_harmony() still sets backend_type to BACKEND_REGULAR;
update new_harmony() to set backend_type to BACKEND_HARMONY so the backend label
is consistent. Locate the new_harmony() constructor (and any Pipeline or Router
builder that initializes backend_type) and replace the BACKEND_REGULAR
assignment with BACKEND_HARMONY so RequestExecutionStage::new(...,
metrics_labels::BACKEND_HARMONY) and backend_type use the same symbol.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/harmony/streaming.rs`:
- Around line 527-533: process_decode_stream currently takes a StatsProtoStream
but never emits request-level stats because it stops at the existing
mark_completed() call (see mark_completed reference in this function); update
the end-of-stream/cleanup logic in process_decode_stream so that when the
incoming StatsProtoStream contains request statistics you call the request-level
stats emitter/mark_completed path (i.e., invoke the same request-statistics
completion that the dual-stream Responses code uses) before returning, ensuring
request stats are emitted for both single- and dual-stream Harmony Responses;
look for process_decode_stream, StatsProtoStream and mark_completed to add the
conditional completion call and emit the request-level stats.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 1517-1525: The process_messages_streaming_chunks function accepts
a StatsProtoStream but never starts stats emission: before calling
grpc_stream.mark_completed() (and before returning) invoke
grpc_stream.spawn_stats_emission() so the request statistics are emitted for the
Messages streaming endpoint; ensure spawn_stats_emission is called (and awaited
or left to run as its API requires) prior to or immediately after mark_completed
in process_messages_streaming_chunks so --enable-request-statistics is not a
no-op.
- Around line 179-185: The finalization of StatsProtoStream must occur
immediately after the read loop completes and before any fallible tx.send(...)
trailer writes: call grpc_stream.mark_completed() and then
context::spawn_stats_emission(grpc_stream, dispatch, original_request) (or the
existing spawn_stats_emission helper you use) in the post-read path — before
performing the subsequent tx.send(...) trailer SSE writes — so that a
fully-consumed backend stream is finalized and stats emission is scheduled even
if a client disconnects during the trailer sends; update the code paths that
currently call mark_completed()/spawn_stats_emission() after the tx.send(...)
calls (including the branch covering the second trailer) to instead finalize and
spawn emission first, then proceed with tx.send(...) and handle send errors as
before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e2cc528-8d2d-4bdf-a7a7-a7ed74b472c5

📥 Commits

Reviewing files that changed from the base of the PR and between ce6f52e and e5db99d.

📒 Files selected for processing (17)
  • crates/grpc_client/proto/sglang_scheduler.proto
  • crates/grpc_client/src/sglang_scheduler.rs
  • grpc_servicer/smg_grpc_servicer/sglang/request_manager.py
  • grpc_servicer/smg_grpc_servicer/sglang/servicer.py
  • model_gateway/src/config/builder.rs
  • model_gateway/src/config/types.rs
  • model_gateway/src/main.rs
  • model_gateway/src/observability/events.rs
  • model_gateway/src/routers/grpc/common/response_collection.rs
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
  • model_gateway/src/routers/grpc/context.rs
  • model_gateway/src/routers/grpc/harmony/streaming.rs
  • model_gateway/src/routers/grpc/pd_router.rs
  • model_gateway/src/routers/grpc/pipeline.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/router.rs

Comment on lines +562 to +577
async def GetRequestStats(
self,
request: sglang_scheduler_pb2.GetRequestStatsRequest,
context: grpc.aio.ServicerContext,
) -> sglang_scheduler_pb2.GetRequestStatsResponse:
"""Get request-level statistics for a completed request."""
try:
stored = self.request_manager.get_request_stats(request.request_id)
stats = [self._build_request_stats(entry) for entry in stored]
return sglang_scheduler_pb2.GetRequestStatsResponse(stats=stats)
except Exception as e:
logger.error(f"GetRequestStats failed: {e}\n{get_exception_traceback()}")
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"Failed to get request stats: {e}")
return sglang_scheduler_pb2.GetRequestStatsResponse()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify INTERNAL error handling convention usage in servicer implementations.
rg -n --type=py 'context\.abort\(grpc\.StatusCode\.INTERNAL,\s*str\(e\)\)' grpc_servicer/smg_grpc_servicer -C2
rg -n --type=py 'context\.set_code\(grpc\.StatusCode\.INTERNAL\)|context\.set_details\(' grpc_servicer/smg_grpc_servicer/sglang/servicer.py -C2

Repository: lightseekorg/smg

Length of output: 2568


Use await context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of set_code/set_details + empty response.

Lines 574–575 diverge from the servicer-wide INTERNAL error handling convention (observed in Embed, Generate, and HealthCheck elsewhere in this file). Use the established pattern of logging server-side and calling context.abort() with the exception string.

🔧 Proposed fix
         except Exception as e:
             logger.error(f"GetRequestStats failed: {e}\n{get_exception_traceback()}")
-            context.set_code(grpc.StatusCode.INTERNAL)
-            context.set_details(f"Failed to get request stats: {e}")
-            return sglang_scheduler_pb2.GetRequestStatsResponse()
+            await context.abort(grpc.StatusCode.INTERNAL, str(e))

Note: GetLoads (lines 548–549) has the same issue and should be fixed similarly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/sglang/servicer.py` around lines 562 - 577,
Replace the current error-handling in GetRequestStats to follow the
servicer-wide convention: after logging the exception (using logger.error with
get_exception_traceback()), call await context.abort(grpc.StatusCode.INTERNAL,
str(e)) instead of context.set_code/context.set_details and returning an empty
response; locate the GetRequestStats method and change its except block
accordingly (references: GetRequestStats, request_manager.get_request_stats,
_build_request_stats, context.abort). Also apply the same fix to GetLoads so
both methods use await context.abort(...) on exceptions.

Comment on lines +147 to +152
if let Some(rate) = other.cache_hit_rate {
self.cache_hit_rate.get_or_insert(rate);
}
if let Some(rate) = other.spec_decoding_acceptance_rate {
self.spec_decoding_acceptance_rate.get_or_insert(rate);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t keep the first sample’s spec-decoding rate.

For n > 1, spec_decoding_acceptance_rate is per-sample. get_or_insert makes the merged request-level stat depend on whichever sample happens to be reduced first, even though completion_tokens are already being aggregated across samples. Please merge this deterministically instead, e.g. with a completion_tokens-weighted average.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/observability/events.rs` around lines 147 - 152, The
current merge uses get_or_insert for spec_decoding_acceptance_rate, which
preserves the first sample's per-sample rate; instead compute a deterministic
completion_tokens-weighted average using self.completion_tokens and
other.completion_tokens and set self.spec_decoding_acceptance_rate =
Some(weighted_rate). If one side is None, fall back to the existing Some; if
both are Some but total_tokens == 0, pick a deterministic choice (e.g., average
the two or keep self). Update the merge logic in the method that handles merging
events (the code around spec_decoding_acceptance_rate and completion_tokens) to
compute and assign the weighted value rather than calling get_or_insert.

Signed-off-by: Scott Lee <scott@together.ai>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7d3dbd41a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +307 to +311
self._handle_single_request(
gen_obj,
gen_request_id,
grpc_context,
parent_request_id=base_request_id,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include the prefix-caching phase in n>1 request stats

For any SGLang request with n > 1, generate_request() still performs the separate base_request_id-prefix prefill round-trip just above, but only the phase-2 children are copied onto the parent request here via parent_request_id=base_request_id. That means GetRequestStats(base_request_id) omits the entire prefix-caching phase, so the new request_stats event under-reports latency and prompt-side work for every multi-sample request.

Useful? React with 👍 / 👎.

Comment on lines +915 to +918
if time_stats.response_sent_to_client_time > 0.0:
meta_info.setdefault(
"response_sent_timestamp_s",
time_stats.get_response_sent_to_client_realtime(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record response_sent timestamp before exporting stats

_populate_timestamps() now advertises response_sent_timestamp_s, but this gRPC request manager never records response_sent_to_client_time before _store_request_stats() persists the final meta. As a result, every emitted request_stats event will leave response_sent_timestamp_s unset, so any dashboard or experiment that needs server-to-client send latency cannot use the new field.

Useful? React with 👍 / 👎.

Comment on lines +855 to +857
state.time_stats.set_finished_time()

meta: dict[str, Any] = state.time_stats.convert_to_output_meta_info(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the scheduler finish time when building meta

_handle_batch_output() already stamps finished_time as soon as the scheduler marks the request complete, but this new call overwrites it later when the final item is dequeued for gRPC. On slow consumers or a backed-up out_queue, request_finished_timestamp_s and the derived latency fields now include post-engine delivery delay instead of actual engine completion time, which skews the request_stats event.

Useful? React with 👍 / 👎.

Signed-off-by: Scott Lee <scott@together.ai>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py`:
- Around line 876-880: The current checks assume batch_out.spec_verify_ct and
batch_out.spec_accepted_tokens are non-None and index-aligned, which can raise
when they are None or shorter; update the conditional around the completion-path
stats emission to first guard each array for truthiness/None and length (e.g.,
ensure batch_out.spec_verify_ct is not None and index <
len(batch_out.spec_verify_ct) and batch_out.spec_verify_ct[index] > 0, and
similarly ensure batch_out.spec_accepted_tokens is not None and index <
len(batch_out.spec_accepted_tokens)) before any len() or index access; apply the
same defensive checks to the later block that uses these arrays (the code around
lines referenced 883-885) so completion-path stats emission won’t throw when
arrays are missing or misaligned.
- Around line 938-944: The eviction currently uses insertion-order on
self.completed_request_stats so parent aggregates can be evicted even after
updates; when you write a parent aggregate (in the block that checks
parent_request_id and sets parent_entries using parent_index and entry), refresh
that key's recency before eviction by "touching" it: either remove and re-insert
the parent_request_id key or call OrderedDict.move_to_end(parent_request_id) if
completed_request_stats is an OrderedDict, then assign
parent_entries[parent_index] = {**entry, "index": parent_index}; keep the
existing while loop that trims entries to self._MAX_COMPLETED_STATS_CACHE so the
FIFO eviction still runs on truly stale keys. This ensures updates to parent
aggregates prevent premature eviction.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ec26e84-7760-4c2e-81b2-843d76870903

📥 Commits

Reviewing files that changed from the base of the PR and between e7d3dbd and 2e29ece.

📒 Files selected for processing (1)
  • grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

Comment on lines +876 to +880
if not (
len(batch_out.spec_verify_ct) > index
and batch_out.spec_verify_ct[index] > 0
and len(batch_out.spec_accepted_tokens) > index
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard speculative arrays before len/index access.

Line 877 assumes batch_out.spec_verify_ct and batch_out.spec_accepted_tokens are always present and index-aligned. If either is None or shorter than index, this can raise and skip completion-path stats emission.

Proposed defensive fix
     def _populate_spec_decoding_metrics(
         self,
         meta_info: dict[str, Any],
         batch_out: BatchTokenIDOutput,
         index: int,
     ) -> None:
         """Populate speculative decoding stats when available."""
-        if not (
-            len(batch_out.spec_verify_ct) > index
-            and batch_out.spec_verify_ct[index] > 0
-            and len(batch_out.spec_accepted_tokens) > index
-        ):
+        spec_verify_ct = batch_out.spec_verify_ct or []
+        spec_accepted_tokens = batch_out.spec_accepted_tokens or []
+        if not (
+            index < len(spec_verify_ct)
+            and spec_verify_ct[index] > 0
+            and index < len(spec_accepted_tokens)
+        ):
             return
 
-        verify_ct = batch_out.spec_verify_ct[index]
-        accepted_tokens = batch_out.spec_accepted_tokens[index]
+        verify_ct = spec_verify_ct[index]
+        accepted_tokens = spec_accepted_tokens[index]

Also applies to: 883-885

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 876 -
880, The current checks assume batch_out.spec_verify_ct and
batch_out.spec_accepted_tokens are non-None and index-aligned, which can raise
when they are None or shorter; update the conditional around the completion-path
stats emission to first guard each array for truthiness/None and length (e.g.,
ensure batch_out.spec_verify_ct is not None and index <
len(batch_out.spec_verify_ct) and batch_out.spec_verify_ct[index] > 0, and
similarly ensure batch_out.spec_accepted_tokens is not None and index <
len(batch_out.spec_accepted_tokens)) before any len() or index access; apply the
same defensive checks to the later block that uses these arrays (the code around
lines referenced 883-885) so completion-path stats emission won’t throw when
arrays are missing or misaligned.

Comment on lines +938 to +944
if parent_request_id is not None:
parent_entries = self.completed_request_stats.setdefault(parent_request_id, {})
parent_entries[parent_index] = {**entry, "index": parent_index}

while len(self.completed_request_stats) > self._MAX_COMPLETED_STATS_CACHE:
oldest = next(iter(self.completed_request_stats))
del self.completed_request_stats[oldest]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Eviction policy can drop in-progress parent aggregates.

Line 942 evicts by oldest insertion order, but updates do not refresh key recency. Under high throughput, a parent aggregate key can be evicted while later child samples are still arriving, returning partial stats for the parent request.

Proposed fix to “touch” keys on write before FIFO eviction
-        entries = self.completed_request_stats.setdefault(request_id, {})
+        entries = self.completed_request_stats.pop(request_id, {})
         entries[index] = entry
+        self.completed_request_stats[request_id] = entries
 
-        if parent_request_id is not None:
-            parent_entries = self.completed_request_stats.setdefault(parent_request_id, {})
+        if parent_request_id is not None and parent_request_id != request_id:
+            parent_entries = self.completed_request_stats.pop(parent_request_id, {})
             parent_entries[parent_index] = {**entry, "index": parent_index}
+            self.completed_request_stats[parent_request_id] = parent_entries
 
         while len(self.completed_request_stats) > self._MAX_COMPLETED_STATS_CACHE:
             oldest = next(iter(self.completed_request_stats))
             del self.completed_request_stats[oldest]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 938 -
944, The eviction currently uses insertion-order on self.completed_request_stats
so parent aggregates can be evicted even after updates; when you write a parent
aggregate (in the block that checks parent_request_id and sets parent_entries
using parent_index and entry), refresh that key's recency before eviction by
"touching" it: either remove and re-insert the parent_request_id key or call
OrderedDict.move_to_end(parent_request_id) if completed_request_stats is an
OrderedDict, then assign parent_entries[parent_index] = {**entry, "index":
parent_index}; keep the existing while loop that trims entries to
self._MAX_COMPLETED_STATS_CACHE so the FIFO eviction still runs on truly stale
keys. This ensures updates to parent aggregates prevent premature eviction.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e29ece8e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Mark both streams as completed now that both succeeded
prefill.mark_completed();
decode_stream.mark_completed();
decode_stream.spawn_stats_emission();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fetch request stats from both PD workers

Checked the SGLang PD path in request_execution.rs::execute_dual_dispatch(): it launches separate prefill and decode RPCs, but this block only calls spawn_stats_emission() on decode_stream. For any non-streaming disaggregated request, the emitted request_stats event will therefore omit the prefill worker's latency/cache work and systematically under-report prompt-side cost whenever PD mode is enabled.

Useful? React with 👍 / 👎.

Comment on lines +628 to +629
if is_finished:
meta_info.update(self._build_finished_meta(state, batch_out, i))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stamp finished_time before building request stats

_build_finished_meta() reads state.time_stats.finished_time via _populate_timestamps(), but here it runs before the completion branch later calls state.time_stats.set_finished_time(). On normal completions that leaves request_finished_timestamp_s unset in every stored RequestStats, so the new observability event cannot report when the engine actually finished the request.

Useful? React with 👍 / 👎.

Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py`:
- Around line 615-631: The code indexes batch_out.finished_reasons without a
None check while other per-item arrays are guarded, risking a TypeError; update
the access in the is_finished assignment and the meta_info "finish_reason"
population to mirror the defensive pattern used for
prompt_tokens/completion_tokens/cached_tokens: first test if
batch_out.finished_reasons is truthy (or not None) and only then index [i],
otherwise default is_finished to False and finish_reason to None, and keep
updating state.last_* accordingly (refer to batch_out, finished_reasons,
is_finished, meta_info, and
state.last_prompt_tokens/last_completion_tokens/last_cached_tokens).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5270fa41-f726-4c68-91c2-19270e88a96c

📥 Commits

Reviewing files that changed from the base of the PR and between 2e29ece and 37b196e.

📒 Files selected for processing (1)
  • grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

Comment on lines +615 to +631
prompt_tokens = batch_out.prompt_tokens[i] if batch_out.prompt_tokens else 0
completion_tokens = batch_out.completion_tokens[i] if batch_out.completion_tokens else 0
cached_tokens = batch_out.cached_tokens[i] if batch_out.cached_tokens else 0
is_finished = batch_out.finished_reasons[i] is not None

state.last_prompt_tokens = prompt_tokens
state.last_completion_tokens = completion_tokens
state.last_cached_tokens = cached_tokens

meta_info: dict[str, Any] = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"cached_tokens": cached_tokens,
"finish_reason": (
batch_out.finished_reasons[i] if batch_out.finished_reasons[i] else None
),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent defensive access for finished_reasons.

Lines 615-617 defensively check prompt_tokens, completion_tokens, and cached_tokens for None before indexing, but finished_reasons is accessed directly at lines 618 and 628-630 without a similar guard. If finished_reasons could ever be None (as the pattern suggests for other arrays), this would raise TypeError.

🛡️ Proposed fix for consistency
-            is_finished = batch_out.finished_reasons[i] is not None
+            is_finished = (
+                batch_out.finished_reasons is not None
+                and batch_out.finished_reasons[i] is not None
+            )
 
             state.last_prompt_tokens = prompt_tokens
             state.last_completion_tokens = completion_tokens
             state.last_cached_tokens = cached_tokens
 
             meta_info: dict[str, Any] = {
                 "prompt_tokens": prompt_tokens,
                 "completion_tokens": completion_tokens,
                 "cached_tokens": cached_tokens,
                 "finish_reason": (
-                    batch_out.finished_reasons[i] if batch_out.finished_reasons[i] else None
+                    batch_out.finished_reasons[i]
+                    if batch_out.finished_reasons and batch_out.finished_reasons[i]
+                    else None
                 ),
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 615 -
631, The code indexes batch_out.finished_reasons without a None check while
other per-item arrays are guarded, risking a TypeError; update the access in the
is_finished assignment and the meta_info "finish_reason" population to mirror
the defensive pattern used for prompt_tokens/completion_tokens/cached_tokens:
first test if batch_out.finished_reasons is truthy (or not None) and only then
index [i], otherwise default is_finished to False and finish_reason to None, and
keep updating state.last_* accordingly (refer to batch_out, finished_reasons,
is_finished, meta_info, and
state.last_prompt_tokens/last_completion_tokens/last_cached_tokens).

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37b196ee91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +396 to +397
if state.time_stats.response_sent_to_client_time == 0.0:
state.time_stats.set_response_sent_to_client_time()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stamp streaming response_sent_timestamp_s after the final chunk

For any SGLang streaming request with more than one chunk, this records response_sent_to_client_time on the first chunk and never updates it because of the == 0.0 guard. The new response_sent_timestamp_s field therefore collapses to a TTFT-like timestamp instead of “response finished streaming”, so dashboards derived from these stats will systematically under-report end-to-end send latency on successful streams.

Useful? React with 👍 / 👎.

Comment on lines +572 to +573
mut prefill_stream: StatsProtoStream,
decode_stream: StatsProtoStream,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit request stats for the prefill half of PD streams

Checked the regular PD streaming helpers in model_gateway/src/routers/grpc/regular/streaming.rs: after the decode side succeeds they only call prefill_stream.mark_completed(), but never prefill_stream.spawn_stats_emission(). With --enable-request-statistics, disaggregated streaming chat requests therefore log only decode-worker stats, so prompt-side latency/cache work from the prefill worker disappears; the same pattern is repeated in process_generate_streaming_dual() for streaming /generate traffic.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant