feat(observability): add logging request level statistics from engine#757
feat(observability): add logging request level statistics from engine#757
Conversation
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant observability enhancement by enabling the collection and logging of detailed performance statistics for individual inference requests. It establishes a foundational framework that is engine-agnostic, with initial support for SGLang, to capture critical metrics such as request timings, token counts, and cache hit rates. This data is crucial for performance analysis, debugging, and production monitoring. The feature is designed to be opt-in, controlled by a new command-line flag, ensuring it only impacts users who explicitly enable it. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable observability feature for logging request-level statistics from the inference engine. The implementation is well-designed, using a feature flag for easy toggling, a unified data structure for engine-agnostic stats, and a non-blocking approach for fetching the data. The code is clean, well-tested, and the changes are consistently applied across the codebase. I have one suggestion to make the code even more idiomatic.
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
|
Hi @scottjlee, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5135d1295f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._store_request_stats( | ||
| request_id, | ||
| final_response, |
There was a problem hiding this comment.
Use the yielded streaming response when persisting stats
final_response is only assigned inside the if not is_stream branch above, so on every stream=true request this call raises UnboundLocalError as soon as the finishing chunk arrives. Because _store_request_stats(...) is unconditional, the regression is not gated by --enable-request-statistics: successful SGLang streams can now terminate with an INTERNAL error at the end of the response instead of closing cleanly.
Useful? React with 👍 / 👎.
| if let Some(rate) = other.spec_decoding_acceptance_rate { | ||
| self.spec_decoding_acceptance_rate.get_or_insert(rate); |
There was a problem hiding this comment.
Aggregate per-sample acceptance rates before logging
ProtoStream::spawn_stats_emission() reduces one RequestStats per sample into a single UnifiedRequestStats, but these lines keep the first sample's spec_decoding_acceptance_rate instead of combining the values. For n>1 requests with speculative decoding enabled, different samples can have very different acceptance rates, so the emitted request_stats event reports an arbitrary sample-0 number while completion_tokens has already been summed across all samples.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
wasn't sure what would be the most appropriate aggregation way for spec acceptance rate across multiple samples, so i just kept the first value. maybe max is most appropriate?
Signed-off-by: Scott Lee <scott@together.ai>
|
Hi @scottjlee, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5db99ddb4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| pub async fn process_messages_streaming_chunks( | ||
| &self, | ||
| mut grpc_stream: ProtoStream, | ||
| mut grpc_stream: StatsProtoStream, |
There was a problem hiding this comment.
Emit request stats from Messages API streams
After switching this handler to StatsProtoStream, it still only calls mark_completed() before returning; unlike the other streaming handlers, it never calls spawn_stats_emission(). With --enable-request-statistics, every successful streaming /v1/messages request therefore finishes without a request_stats event, so the new observability feature silently does nothing for Anthropic-style streaming traffic.
Useful? React with 👍 / 👎.
| /// Process decode stream for tool call events. | ||
| async fn process_decode_stream( | ||
| mut decode_stream: ProtoStream, | ||
| mut decode_stream: StatsProtoStream, |
There was a problem hiding this comment.
Emit request stats from Harmony Responses streams
This decode path now accepts StatsProtoStream, but after the stream is marked completed it returns without ever invoking spawn_stats_emission(). When --enable-request-statistics is on, successful streaming /v1/responses requests on the Harmony backend—including tool-call iterations—will never log the new request-level stats event.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
model_gateway/src/routers/grpc/regular/streaming.rs (2)
1517-1525:⚠️ Potential issue | 🟠 MajorMessages streaming never uses the stats wrapper.
This helper now accepts
StatsProtoStream, but it returns aftergrpc_stream.mark_completed()on Line 2094 without ever callinggrpc_stream.spawn_stats_emission().--enable-request-statisticsis therefore a no-op for the Messages streaming endpoint, unlike the chat and generate paths in this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1517 - 1525, The process_messages_streaming_chunks function accepts a StatsProtoStream but never starts stats emission: before calling grpc_stream.mark_completed() (and before returning) invoke grpc_stream.spawn_stats_emission() so the request statistics are emitted for the Messages streaming endpoint; ensure spawn_stats_emission is called (and awaited or left to run as its API requires) prior to or immediately after mark_completed in process_messages_streaming_chunks so --enable-request-statistics is not a no-op.
179-185:⚠️ Potential issue | 🟠 MajorFinalize
StatsProtoStreambefore the trailer SSE writes.After the read loop is exhausted, phases 3-5 still do several fallible
tx.send(...)calls. If the client disconnects there, control returns before Line 547 and Line 563, so a fully-consumed backend stream is still dropped as aborted and never schedules request-stats emission. Please movemark_completed()/spawn_stats_emission()into the post-read path before those trailer writes.Also applies to: 563-563
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 179 - 185, The finalization of StatsProtoStream must occur immediately after the read loop completes and before any fallible tx.send(...) trailer writes: call grpc_stream.mark_completed() and then context::spawn_stats_emission(grpc_stream, dispatch, original_request) (or the existing spawn_stats_emission helper you use) in the post-read path — before performing the subsequent tx.send(...) trailer SSE writes — so that a fully-consumed backend stream is finalized and stats emission is scheduled even if a client disconnects during the trailer sends; update the code paths that currently call mark_completed()/spawn_stats_emission() after the tx.send(...) calls (including the branch covering the second trailer) to instead finalize and spawn emission first, then proceed with tx.send(...) and handle send errors as before.model_gateway/src/routers/grpc/harmony/streaming.rs (1)
527-533:⚠️ Potential issue | 🟠 MajorEmit request stats from the Harmony Responses helper.
process_decode_stream()now acceptsStatsProtoStream, but it still stops atmark_completed()on Line 1010. Both single- and dual-stream Responses requests funnel through this helper, so--enable-request-statisticsnever emits request-level stats for Harmony Responses streaming.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/harmony/streaming.rs` around lines 527 - 533, process_decode_stream currently takes a StatsProtoStream but never emits request-level stats because it stops at the existing mark_completed() call (see mark_completed reference in this function); update the end-of-stream/cleanup logic in process_decode_stream so that when the incoming StatsProtoStream contains request statistics you call the request-level stats emitter/mark_completed path (i.e., invoke the same request-statistics completion that the dual-stream Responses code uses) before returning, ensuring request stats are emitted for both single- and dual-stream Harmony Responses; look for process_decode_stream, StatsProtoStream and mark_completed to add the conditional completion call and emit the request-level stats.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/servicer.py`:
- Around line 562-577: Replace the current error-handling in GetRequestStats to
follow the servicer-wide convention: after logging the exception (using
logger.error with get_exception_traceback()), call await
context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of
context.set_code/context.set_details and returning an empty response; locate the
GetRequestStats method and change its except block accordingly (references:
GetRequestStats, request_manager.get_request_stats, _build_request_stats,
context.abort). Also apply the same fix to GetLoads so both methods use await
context.abort(...) on exceptions.
In `@model_gateway/src/observability/events.rs`:
- Around line 147-152: The current merge uses get_or_insert for
spec_decoding_acceptance_rate, which preserves the first sample's per-sample
rate; instead compute a deterministic completion_tokens-weighted average using
self.completion_tokens and other.completion_tokens and set
self.spec_decoding_acceptance_rate = Some(weighted_rate). If one side is None,
fall back to the existing Some; if both are Some but total_tokens == 0, pick a
deterministic choice (e.g., average the two or keep self). Update the merge
logic in the method that handles merging events (the code around
spec_decoding_acceptance_rate and completion_tokens) to compute and assign the
weighted value rather than calling get_or_insert.
In `@model_gateway/src/routers/grpc/pipeline.rs`:
- Around line 129-132: The pipeline tags RequestExecutionStage with
BACKEND_HARMONY but new_harmony() still sets backend_type to BACKEND_REGULAR;
update new_harmony() to set backend_type to BACKEND_HARMONY so the backend label
is consistent. Locate the new_harmony() constructor (and any Pipeline or Router
builder that initializes backend_type) and replace the BACKEND_REGULAR
assignment with BACKEND_HARMONY so RequestExecutionStage::new(...,
metrics_labels::BACKEND_HARMONY) and backend_type use the same symbol.
---
Outside diff comments:
In `@model_gateway/src/routers/grpc/harmony/streaming.rs`:
- Around line 527-533: process_decode_stream currently takes a StatsProtoStream
but never emits request-level stats because it stops at the existing
mark_completed() call (see mark_completed reference in this function); update
the end-of-stream/cleanup logic in process_decode_stream so that when the
incoming StatsProtoStream contains request statistics you call the request-level
stats emitter/mark_completed path (i.e., invoke the same request-statistics
completion that the dual-stream Responses code uses) before returning, ensuring
request stats are emitted for both single- and dual-stream Harmony Responses;
look for process_decode_stream, StatsProtoStream and mark_completed to add the
conditional completion call and emit the request-level stats.
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 1517-1525: The process_messages_streaming_chunks function accepts
a StatsProtoStream but never starts stats emission: before calling
grpc_stream.mark_completed() (and before returning) invoke
grpc_stream.spawn_stats_emission() so the request statistics are emitted for the
Messages streaming endpoint; ensure spawn_stats_emission is called (and awaited
or left to run as its API requires) prior to or immediately after mark_completed
in process_messages_streaming_chunks so --enable-request-statistics is not a
no-op.
- Around line 179-185: The finalization of StatsProtoStream must occur
immediately after the read loop completes and before any fallible tx.send(...)
trailer writes: call grpc_stream.mark_completed() and then
context::spawn_stats_emission(grpc_stream, dispatch, original_request) (or the
existing spawn_stats_emission helper you use) in the post-read path — before
performing the subsequent tx.send(...) trailer SSE writes — so that a
fully-consumed backend stream is finalized and stats emission is scheduled even
if a client disconnects during the trailer sends; update the code paths that
currently call mark_completed()/spawn_stats_emission() after the tx.send(...)
calls (including the branch covering the second trailer) to instead finalize and
spawn emission first, then proceed with tx.send(...) and handle send errors as
before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7e2cc528-8d2d-4bdf-a7a7-a7ed74b472c5
📒 Files selected for processing (17)
crates/grpc_client/proto/sglang_scheduler.protocrates/grpc_client/src/sglang_scheduler.rsgrpc_servicer/smg_grpc_servicer/sglang/request_manager.pygrpc_servicer/smg_grpc_servicer/sglang/servicer.pymodel_gateway/src/config/builder.rsmodel_gateway/src/config/types.rsmodel_gateway/src/main.rsmodel_gateway/src/observability/events.rsmodel_gateway/src/routers/grpc/common/response_collection.rsmodel_gateway/src/routers/grpc/common/stages/request_execution.rsmodel_gateway/src/routers/grpc/context.rsmodel_gateway/src/routers/grpc/harmony/streaming.rsmodel_gateway/src/routers/grpc/pd_router.rsmodel_gateway/src/routers/grpc/pipeline.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/router.rs
| async def GetRequestStats( | ||
| self, | ||
| request: sglang_scheduler_pb2.GetRequestStatsRequest, | ||
| context: grpc.aio.ServicerContext, | ||
| ) -> sglang_scheduler_pb2.GetRequestStatsResponse: | ||
| """Get request-level statistics for a completed request.""" | ||
| try: | ||
| stored = self.request_manager.get_request_stats(request.request_id) | ||
| stats = [self._build_request_stats(entry) for entry in stored] | ||
| return sglang_scheduler_pb2.GetRequestStatsResponse(stats=stats) | ||
| except Exception as e: | ||
| logger.error(f"GetRequestStats failed: {e}\n{get_exception_traceback()}") | ||
| context.set_code(grpc.StatusCode.INTERNAL) | ||
| context.set_details(f"Failed to get request stats: {e}") | ||
| return sglang_scheduler_pb2.GetRequestStatsResponse() | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify INTERNAL error handling convention usage in servicer implementations.
rg -n --type=py 'context\.abort\(grpc\.StatusCode\.INTERNAL,\s*str\(e\)\)' grpc_servicer/smg_grpc_servicer -C2
rg -n --type=py 'context\.set_code\(grpc\.StatusCode\.INTERNAL\)|context\.set_details\(' grpc_servicer/smg_grpc_servicer/sglang/servicer.py -C2Repository: lightseekorg/smg
Length of output: 2568
Use await context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of set_code/set_details + empty response.
Lines 574–575 diverge from the servicer-wide INTERNAL error handling convention (observed in Embed, Generate, and HealthCheck elsewhere in this file). Use the established pattern of logging server-side and calling context.abort() with the exception string.
🔧 Proposed fix
except Exception as e:
logger.error(f"GetRequestStats failed: {e}\n{get_exception_traceback()}")
- context.set_code(grpc.StatusCode.INTERNAL)
- context.set_details(f"Failed to get request stats: {e}")
- return sglang_scheduler_pb2.GetRequestStatsResponse()
+ await context.abort(grpc.StatusCode.INTERNAL, str(e))Note: GetLoads (lines 548–549) has the same issue and should be fixed similarly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/sglang/servicer.py` around lines 562 - 577,
Replace the current error-handling in GetRequestStats to follow the
servicer-wide convention: after logging the exception (using logger.error with
get_exception_traceback()), call await context.abort(grpc.StatusCode.INTERNAL,
str(e)) instead of context.set_code/context.set_details and returning an empty
response; locate the GetRequestStats method and change its except block
accordingly (references: GetRequestStats, request_manager.get_request_stats,
_build_request_stats, context.abort). Also apply the same fix to GetLoads so
both methods use await context.abort(...) on exceptions.
| if let Some(rate) = other.cache_hit_rate { | ||
| self.cache_hit_rate.get_or_insert(rate); | ||
| } | ||
| if let Some(rate) = other.spec_decoding_acceptance_rate { | ||
| self.spec_decoding_acceptance_rate.get_or_insert(rate); | ||
| } |
There was a problem hiding this comment.
Don’t keep the first sample’s spec-decoding rate.
For n > 1, spec_decoding_acceptance_rate is per-sample. get_or_insert makes the merged request-level stat depend on whichever sample happens to be reduced first, even though completion_tokens are already being aggregated across samples. Please merge this deterministically instead, e.g. with a completion_tokens-weighted average.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/observability/events.rs` around lines 147 - 152, The
current merge uses get_or_insert for spec_decoding_acceptance_rate, which
preserves the first sample's per-sample rate; instead compute a deterministic
completion_tokens-weighted average using self.completion_tokens and
other.completion_tokens and set self.spec_decoding_acceptance_rate =
Some(weighted_rate). If one side is None, fall back to the existing Some; if
both are Some but total_tokens == 0, pick a deterministic choice (e.g., average
the two or keep self). Update the merge logic in the method that handles merging
events (the code around spec_decoding_acceptance_rate and completion_tokens) to
compute and assign the weighted value rather than calling get_or_insert.
Signed-off-by: Scott Lee <scott@together.ai>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7d3dbd41a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._handle_single_request( | ||
| gen_obj, | ||
| gen_request_id, | ||
| grpc_context, | ||
| parent_request_id=base_request_id, |
There was a problem hiding this comment.
Include the prefix-caching phase in n>1 request stats
For any SGLang request with n > 1, generate_request() still performs the separate base_request_id-prefix prefill round-trip just above, but only the phase-2 children are copied onto the parent request here via parent_request_id=base_request_id. That means GetRequestStats(base_request_id) omits the entire prefix-caching phase, so the new request_stats event under-reports latency and prompt-side work for every multi-sample request.
Useful? React with 👍 / 👎.
| if time_stats.response_sent_to_client_time > 0.0: | ||
| meta_info.setdefault( | ||
| "response_sent_timestamp_s", | ||
| time_stats.get_response_sent_to_client_realtime(), |
There was a problem hiding this comment.
Record response_sent timestamp before exporting stats
_populate_timestamps() now advertises response_sent_timestamp_s, but this gRPC request manager never records response_sent_to_client_time before _store_request_stats() persists the final meta. As a result, every emitted request_stats event will leave response_sent_timestamp_s unset, so any dashboard or experiment that needs server-to-client send latency cannot use the new field.
Useful? React with 👍 / 👎.
| state.time_stats.set_finished_time() | ||
|
|
||
| meta: dict[str, Any] = state.time_stats.convert_to_output_meta_info( |
There was a problem hiding this comment.
Preserve the scheduler finish time when building meta
_handle_batch_output() already stamps finished_time as soon as the scheduler marks the request complete, but this new call overwrites it later when the final item is dequeued for gRPC. On slow consumers or a backed-up out_queue, request_finished_timestamp_s and the derived latency fields now include post-engine delivery delay instead of actual engine completion time, which skews the request_stats event.
Useful? React with 👍 / 👎.
Signed-off-by: Scott Lee <scott@together.ai>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py`:
- Around line 876-880: The current checks assume batch_out.spec_verify_ct and
batch_out.spec_accepted_tokens are non-None and index-aligned, which can raise
when they are None or shorter; update the conditional around the completion-path
stats emission to first guard each array for truthiness/None and length (e.g.,
ensure batch_out.spec_verify_ct is not None and index <
len(batch_out.spec_verify_ct) and batch_out.spec_verify_ct[index] > 0, and
similarly ensure batch_out.spec_accepted_tokens is not None and index <
len(batch_out.spec_accepted_tokens)) before any len() or index access; apply the
same defensive checks to the later block that uses these arrays (the code around
lines referenced 883-885) so completion-path stats emission won’t throw when
arrays are missing or misaligned.
- Around line 938-944: The eviction currently uses insertion-order on
self.completed_request_stats so parent aggregates can be evicted even after
updates; when you write a parent aggregate (in the block that checks
parent_request_id and sets parent_entries using parent_index and entry), refresh
that key's recency before eviction by "touching" it: either remove and re-insert
the parent_request_id key or call OrderedDict.move_to_end(parent_request_id) if
completed_request_stats is an OrderedDict, then assign
parent_entries[parent_index] = {**entry, "index": parent_index}; keep the
existing while loop that trims entries to self._MAX_COMPLETED_STATS_CACHE so the
FIFO eviction still runs on truly stale keys. This ensures updates to parent
aggregates prevent premature eviction.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3ec26e84-7760-4c2e-81b2-843d76870903
📒 Files selected for processing (1)
grpc_servicer/smg_grpc_servicer/sglang/request_manager.py
| if not ( | ||
| len(batch_out.spec_verify_ct) > index | ||
| and batch_out.spec_verify_ct[index] > 0 | ||
| and len(batch_out.spec_accepted_tokens) > index | ||
| ): |
There was a problem hiding this comment.
Guard speculative arrays before len/index access.
Line 877 assumes batch_out.spec_verify_ct and batch_out.spec_accepted_tokens are always present and index-aligned. If either is None or shorter than index, this can raise and skip completion-path stats emission.
Proposed defensive fix
def _populate_spec_decoding_metrics(
self,
meta_info: dict[str, Any],
batch_out: BatchTokenIDOutput,
index: int,
) -> None:
"""Populate speculative decoding stats when available."""
- if not (
- len(batch_out.spec_verify_ct) > index
- and batch_out.spec_verify_ct[index] > 0
- and len(batch_out.spec_accepted_tokens) > index
- ):
+ spec_verify_ct = batch_out.spec_verify_ct or []
+ spec_accepted_tokens = batch_out.spec_accepted_tokens or []
+ if not (
+ index < len(spec_verify_ct)
+ and spec_verify_ct[index] > 0
+ and index < len(spec_accepted_tokens)
+ ):
return
- verify_ct = batch_out.spec_verify_ct[index]
- accepted_tokens = batch_out.spec_accepted_tokens[index]
+ verify_ct = spec_verify_ct[index]
+ accepted_tokens = spec_accepted_tokens[index]Also applies to: 883-885
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 876 -
880, The current checks assume batch_out.spec_verify_ct and
batch_out.spec_accepted_tokens are non-None and index-aligned, which can raise
when they are None or shorter; update the conditional around the completion-path
stats emission to first guard each array for truthiness/None and length (e.g.,
ensure batch_out.spec_verify_ct is not None and index <
len(batch_out.spec_verify_ct) and batch_out.spec_verify_ct[index] > 0, and
similarly ensure batch_out.spec_accepted_tokens is not None and index <
len(batch_out.spec_accepted_tokens)) before any len() or index access; apply the
same defensive checks to the later block that uses these arrays (the code around
lines referenced 883-885) so completion-path stats emission won’t throw when
arrays are missing or misaligned.
| if parent_request_id is not None: | ||
| parent_entries = self.completed_request_stats.setdefault(parent_request_id, {}) | ||
| parent_entries[parent_index] = {**entry, "index": parent_index} | ||
|
|
||
| while len(self.completed_request_stats) > self._MAX_COMPLETED_STATS_CACHE: | ||
| oldest = next(iter(self.completed_request_stats)) | ||
| del self.completed_request_stats[oldest] |
There was a problem hiding this comment.
Eviction policy can drop in-progress parent aggregates.
Line 942 evicts by oldest insertion order, but updates do not refresh key recency. Under high throughput, a parent aggregate key can be evicted while later child samples are still arriving, returning partial stats for the parent request.
Proposed fix to “touch” keys on write before FIFO eviction
- entries = self.completed_request_stats.setdefault(request_id, {})
+ entries = self.completed_request_stats.pop(request_id, {})
entries[index] = entry
+ self.completed_request_stats[request_id] = entries
- if parent_request_id is not None:
- parent_entries = self.completed_request_stats.setdefault(parent_request_id, {})
+ if parent_request_id is not None and parent_request_id != request_id:
+ parent_entries = self.completed_request_stats.pop(parent_request_id, {})
parent_entries[parent_index] = {**entry, "index": parent_index}
+ self.completed_request_stats[parent_request_id] = parent_entries
while len(self.completed_request_stats) > self._MAX_COMPLETED_STATS_CACHE:
oldest = next(iter(self.completed_request_stats))
del self.completed_request_stats[oldest]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 938 -
944, The eviction currently uses insertion-order on self.completed_request_stats
so parent aggregates can be evicted even after updates; when you write a parent
aggregate (in the block that checks parent_request_id and sets parent_entries
using parent_index and entry), refresh that key's recency before eviction by
"touching" it: either remove and re-insert the parent_request_id key or call
OrderedDict.move_to_end(parent_request_id) if completed_request_stats is an
OrderedDict, then assign parent_entries[parent_index] = {**entry, "index":
parent_index}; keep the existing while loop that trims entries to
self._MAX_COMPLETED_STATS_CACHE so the FIFO eviction still runs on truly stale
keys. This ensures updates to parent aggregates prevent premature eviction.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e29ece8e4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Mark both streams as completed now that both succeeded | ||
| prefill.mark_completed(); | ||
| decode_stream.mark_completed(); | ||
| decode_stream.spawn_stats_emission(); |
There was a problem hiding this comment.
Fetch request stats from both PD workers
Checked the SGLang PD path in request_execution.rs::execute_dual_dispatch(): it launches separate prefill and decode RPCs, but this block only calls spawn_stats_emission() on decode_stream. For any non-streaming disaggregated request, the emitted request_stats event will therefore omit the prefill worker's latency/cache work and systematically under-report prompt-side cost whenever PD mode is enabled.
Useful? React with 👍 / 👎.
| if is_finished: | ||
| meta_info.update(self._build_finished_meta(state, batch_out, i)) |
There was a problem hiding this comment.
Stamp finished_time before building request stats
_build_finished_meta() reads state.time_stats.finished_time via _populate_timestamps(), but here it runs before the completion branch later calls state.time_stats.set_finished_time(). On normal completions that leaves request_finished_timestamp_s unset in every stored RequestStats, so the new observability event cannot report when the engine actually finished the request.
Useful? React with 👍 / 👎.
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py`:
- Around line 615-631: The code indexes batch_out.finished_reasons without a
None check while other per-item arrays are guarded, risking a TypeError; update
the access in the is_finished assignment and the meta_info "finish_reason"
population to mirror the defensive pattern used for
prompt_tokens/completion_tokens/cached_tokens: first test if
batch_out.finished_reasons is truthy (or not None) and only then index [i],
otherwise default is_finished to False and finish_reason to None, and keep
updating state.last_* accordingly (refer to batch_out, finished_reasons,
is_finished, meta_info, and
state.last_prompt_tokens/last_completion_tokens/last_cached_tokens).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5270fa41-f726-4c68-91c2-19270e88a96c
📒 Files selected for processing (1)
grpc_servicer/smg_grpc_servicer/sglang/request_manager.py
| prompt_tokens = batch_out.prompt_tokens[i] if batch_out.prompt_tokens else 0 | ||
| completion_tokens = batch_out.completion_tokens[i] if batch_out.completion_tokens else 0 | ||
| cached_tokens = batch_out.cached_tokens[i] if batch_out.cached_tokens else 0 | ||
| is_finished = batch_out.finished_reasons[i] is not None | ||
|
|
||
| state.last_prompt_tokens = prompt_tokens | ||
| state.last_completion_tokens = completion_tokens | ||
| state.last_cached_tokens = cached_tokens | ||
|
|
||
| meta_info: dict[str, Any] = { | ||
| "prompt_tokens": prompt_tokens, | ||
| "completion_tokens": completion_tokens, | ||
| "cached_tokens": cached_tokens, | ||
| "finish_reason": ( | ||
| batch_out.finished_reasons[i] if batch_out.finished_reasons[i] else None | ||
| ), | ||
| } |
There was a problem hiding this comment.
Inconsistent defensive access for finished_reasons.
Lines 615-617 defensively check prompt_tokens, completion_tokens, and cached_tokens for None before indexing, but finished_reasons is accessed directly at lines 618 and 628-630 without a similar guard. If finished_reasons could ever be None (as the pattern suggests for other arrays), this would raise TypeError.
🛡️ Proposed fix for consistency
- is_finished = batch_out.finished_reasons[i] is not None
+ is_finished = (
+ batch_out.finished_reasons is not None
+ and batch_out.finished_reasons[i] is not None
+ )
state.last_prompt_tokens = prompt_tokens
state.last_completion_tokens = completion_tokens
state.last_cached_tokens = cached_tokens
meta_info: dict[str, Any] = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"cached_tokens": cached_tokens,
"finish_reason": (
- batch_out.finished_reasons[i] if batch_out.finished_reasons[i] else None
+ batch_out.finished_reasons[i]
+ if batch_out.finished_reasons and batch_out.finished_reasons[i]
+ else None
),
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 615 -
631, The code indexes batch_out.finished_reasons without a None check while
other per-item arrays are guarded, risking a TypeError; update the access in the
is_finished assignment and the meta_info "finish_reason" population to mirror
the defensive pattern used for prompt_tokens/completion_tokens/cached_tokens:
first test if batch_out.finished_reasons is truthy (or not None) and only then
index [i], otherwise default is_finished to False and finish_reason to None, and
keep updating state.last_* accordingly (refer to batch_out, finished_reasons,
is_finished, meta_info, and
state.last_prompt_tokens/last_completion_tokens/last_cached_tokens).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37b196ee91
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if state.time_stats.response_sent_to_client_time == 0.0: | ||
| state.time_stats.set_response_sent_to_client_time() |
There was a problem hiding this comment.
Stamp streaming
response_sent_timestamp_s after the final chunk
For any SGLang streaming request with more than one chunk, this records response_sent_to_client_time on the first chunk and never updates it because of the == 0.0 guard. The new response_sent_timestamp_s field therefore collapses to a TTFT-like timestamp instead of “response finished streaming”, so dashboards derived from these stats will systematically under-report end-to-end send latency on successful streams.
Useful? React with 👍 / 👎.
| mut prefill_stream: StatsProtoStream, | ||
| decode_stream: StatsProtoStream, |
There was a problem hiding this comment.
Emit request stats for the prefill half of PD streams
Checked the regular PD streaming helpers in model_gateway/src/routers/grpc/regular/streaming.rs: after the decode side succeeds they only call prefill_stream.mark_completed(), but never prefill_stream.spawn_stats_emission(). With --enable-request-statistics, disaggregated streaming chat requests therefore log only decode-worker stats, so prompt-side latency/cache work from the prefill worker disappears; the same pattern is repeated in process_generate_streaming_dual() for streaming /generate traffic.
Useful? React with 👍 / 👎.
Description
Problem
Currently, smg does log or record performance statistics of individual requests (timings, token counts, cache hit rate, etc). It is useful to have these results available at a request level for performance analysis, debugging, etc, as well as for production monitoring.
Solution
--enable-request-statisticsflag.Example of request statistics:
Changes
GetRequestStatsRPC andRequestStatsmessage to the SGLang scheduler proto, which carries request statistics (timestamps, token counts, cache hit rate, spec decoding acceptance rate).UnifiedRequestStatsstruct which represents engine-neutral statistics for each request; this unifies the request statistics from different engines into a single common format, making it easy for downstream users to parse (e.g. ingesting into data pipeline, analytics platform, etc).StatsProtoStreamstruct which wraps the existingProtoStream, used for additionally tracking request metadata used for stats fetching. This is passed toRequestExecutionStageexactly as before, with optional stats fetching metadata added.--enable-request-statisticsflag.The flow of how request stats are fetched:
StatsProtoStream::spawn_stats_emission()is called inStreamingProcessor(streaming case) orcollect_responses()(non-streaming case). Ifenable_request_statisticsis enabled, callProtoStream::spawn_stats_emission()which matches on the engine type; otherwise, skip stats collection.tokio::spawntask which callsSglangSchedulerClient::get_request_stats()via theGetRequestStatsRPC.RequestStatsper sample in the request, merge them together, and produce the standardizedUnifiedRequestStats.RequestStatsEventwith the resultingUnifiedRequestStats.Test Plan
Tested locally with
meta-llama/Llama-3.1-8B-Instructand SGLang. Sample request withn=3samples:Resulting logged request statistics:
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit