feat(observability): add logging request level statistics from engine by scottjlee · Pull Request #757 · lightseekorg/smg

scottjlee · 2026-03-13T22:11:13Z

Description

Problem

Currently, smg does log or record performance statistics of individual requests (timings, token counts, cache hit rate, etc). It is useful to have these results available at a request level for performance analysis, debugging, etc, as well as for production monitoring.

Solution

Adds a unified, engine-agnostic mechanism for fetching and emitting per-request statistics from inference engines, with a concrete SGLang implementation for chat. Other engines can be supported in a future PR.
This feature is disabled by default, and can be toggled on with the --enable-request-statistics flag.

Example of request statistics:

2026-03-13 20:38:56 DEBUG smg::observability::events: model_gateway/src/observability/events.rs:207: request_stats request_id=chatcmpl-019ce8ec-6b83-7a80-b90e-6f54e1a4a10d model=meta-llama/Llama-3.1-8B-Instruct router_backend=regular engine=sglang request_received_timestamp_s=1773434334.2372427 first_token_generated_timestamp_s=1773434334.252578 request_finished_timestamp_s=1773434336.2970366 response_sent_timestamp_s=1773434334.2526355 cache_hit_rate=0.9761904761904762 prompt_tokens=42 completion_tokens=908 cached_tokens=41

Changes

Add a new GetRequestStats RPC and RequestStats message to the SGLang scheduler proto, which carries request statistics (timestamps, token counts, cache hit rate, spec decoding acceptance rate).
Add a new UnifiedRequestStats struct which represents engine-neutral statistics for each request; this unifies the request statistics from different engines into a single common format, making it easy for downstream users to parse (e.g. ingesting into data pipeline, analytics platform, etc).
Add a new StatsProtoStream struct which wraps the existing ProtoStream, used for additionally tracking request metadata used for stats fetching. This is passed to RequestExecutionStage exactly as before, with optional stats fetching metadata added.
Add a new config which enables this feature, via the --enable-request-statistics flag.

The flow of how request stats are fetched:

Request completes
StatsProtoStream::spawn_stats_emission() is called in StreamingProcessor (streaming case) or collect_responses() (non-streaming case). If enable_request_statistics is enabled, call ProtoStream::spawn_stats_emission() which matches on the engine type; otherwise, skip stats collection.
For the matched engine type (currently only SGLang), spawn a fire-and-forget tokio::spawn task which calls SglangSchedulerClient::get_request_stats() via the GetRequestStats RPC.
Collect one RequestStats per sample in the request, merge them together, and produce the standardized UnifiedRequestStats.
Emit a RequestStatsEvent with the resulting UnifiedRequestStats.

Test Plan

Tested locally with meta-llama/Llama-3.1-8B-Instruct and SGLang. Sample request with n=3 samples:

curl -N http://127.0.0.1:30000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role":"user","content":"Count 1 to 100 quickly."}],
    "stream": true,
    "max_tokens": 1024, "n": 3
  }'

Resulting logged request statistics:

2026-03-13 20:38:56 DEBUG smg::observability::events: model_gateway/src/observability/events.rs:207: request_stats request_id=chatcmpl-019ce8ec-6b83-7a80-b90e-6f54e1a4a10d model=meta-llama/Llama-3.1-8B-Instruct router_backend=regular engine=sglang request_received_timestamp_s=1773434334.2372427 first_token_generated_timestamp_s=1773434334.252578 request_finished_timestamp_s=1773434336.2970366 response_sent_timestamp_s=1773434334.2526355 cache_hit_rate=0.9761904761904762 prompt_tokens=42 completion_tokens=908 cached_tokens=41

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Per-request statistics (timestamps, token counts, cache hit rate, speculative-decoding rate) and retrieval by request ID.
- New RPC to query request metrics.
- Config flag and CLI option to enable request-level statistics.
- Automatic emission of aggregated request stats into observability/metrics, including across streaming responses and backend-labeled pipelines.

Signed-off-by: Scott Lee <scott@together.ai>

gemini-code-assist · 2026-03-13T22:11:38Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant observability enhancement by enabling the collection and logging of detailed performance statistics for individual inference requests. It establishes a foundational framework that is engine-agnostic, with initial support for SGLang, to capture critical metrics such as request timings, token counts, and cache hit rates. This data is crucial for performance analysis, debugging, and production monitoring. The feature is designed to be opt-in, controlled by a new command-line flag, ensuring it only impacts users who explicitly enable it.

Highlights

Request Statistics Collection: Implemented a unified, engine-agnostic mechanism to fetch and emit per-request performance statistics from inference engines, starting with SGLang.
SGLang gRPC API Extension: Added a new GetRequestStats RPC and RequestStats message to the SGLang scheduler proto to retrieve detailed request-level metrics.
Unified Data Structure: Introduced UnifiedRequestStats to standardize performance metrics across different engines, facilitating downstream analysis and monitoring.
Feature Toggle: Added a new configuration flag, --enable-request-statistics, allowing users to enable or disable this observability feature.
Stream Wrapper for Stats: Created StatsProtoStream to wrap existing ProtoStream instances, enabling the collection and emission of request statistics upon stream completion.

Changelog

crates/grpc_client/proto/sglang_scheduler.proto
- Added GetRequestStats RPC to the SglangScheduler service.
- Defined GetRequestStatsRequest and RequestStats messages for per-request statistics.
- Defined GetRequestStatsResponse to return a list of RequestStats.
crates/grpc_client/src/sglang_scheduler.rs
- Added request_id and scheduler_client accessors to AbortOnDropStream.
- Implemented get_request_stats method in SglangSchedulerClient to call the new RPC.
model_gateway/src/config/builder.rs
- Added enable_request_statistics builder method to RouterConfigBuilder.
model_gateway/src/config/types.rs
- Added enable_request_statistics field to RouterConfig struct, defaulting to false.
model_gateway/src/main.rs
- Added --enable-request-statistics CLI argument.
- Mapped the new CLI argument to the RouterConfigBuilder.
model_gateway/src/observability/events.rs
- Introduced UnifiedRequestStats struct for engine-agnostic request statistics.
- Added maybe_emit_event and merge methods to UnifiedRequestStats.
- Defined opt_min and opt_max helper functions for merging optional f64 values.
- Created RequestStatsEvent struct and emit_request_stats! macro for logging.
- Added unit tests for UnifiedRequestStats merging and event emission.
model_gateway/src/routers/grpc/common/response_collection.rs
- Updated imports to use StatsProtoStream.
- Modified collect_responses to call spawn_stats_emission on StatsProtoStream after completion.
- Changed collect_stream_responses to accept StatsProtoStream.
model_gateway/src/routers/grpc/common/stages/request_execution.rs
- Imported StatsProtoStream.
- Added router_backend field to RequestExecutionStage.
- Modified RequestExecutionStage::new to accept router_backend.
- Updated execute_single, execute_dual_dispatch, and execute_sequential_pd to return ProtoStream directly, then wrap them in StatsProtoStream within the apply method.
- Wrapped ProtoStream instances with StatsProtoStream before returning ExecutionResult.
model_gateway/src/routers/grpc/context.rs
- Updated imports to use StatsProtoStream.
- Added enable_request_statistics field to SharedComponents.
- Changed ExecutionResult enum variants to use StatsProtoStream instead of ProtoStream.
model_gateway/src/routers/grpc/harmony/streaming.rs
- Updated imports to use StatsProtoStream.
- Modified process_single_stream, process_dual_stream, process_chat_decode_stream, process_responses_dual_stream, and process_decode_stream to accept StatsProtoStream.
- Added spawn_stats_emission call for decode_stream in process_chat_decode_stream.
model_gateway/src/routers/grpc/pd_router.rs
- Retrieved enable_request_statistics from ctx.router_config.
- Passed enable_request_statistics to SharedComponents.
model_gateway/src/routers/grpc/pipeline.rs
- Modified RequestExecutionStage::new calls across various pipelines to include router_backend and metrics_labels::BACKEND_REGULAR or metrics_labels::BACKEND_PD.
model_gateway/src/routers/grpc/proto_wrapper.rs
- Imported UnifiedRequestStats.
- Added unified_stats_from_sglang function to convert SGLang RequestStats to UnifiedRequestStats.
- Implemented spawn_stats_emission in ProtoStream for SGLang, which fetches stats and emits an event.
- Introduced StatsProtoStream struct to wrap ProtoStream and manage stats emission logic, including a spawn_stats_emission method.
- Added unit tests for unified_stats_from_sglang.
model_gateway/src/routers/grpc/regular/streaming.rs
- Updated imports to use StatsProtoStream.
- Modified process_streaming_chunks, process_dual_streaming_chunks, process_generate_streaming, process_generate_streaming_dual, and process_generate_streaming_with_input_logprobs to accept StatsProtoStream.
- Added spawn_stats_emission calls for grpc_stream and stream in various processing functions.
model_gateway/src/routers/grpc/router.rs
- Retrieved enable_request_statistics from ctx.router_config.
- Passed enable_request_statistics to SharedComponents.

Activity

The pull request was created by scottjlee.
The pull request description outlines the problem, solution, changes, and test plan for the new observability feature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable observability feature for logging request-level statistics from the inference engine. The implementation is well-designed, using a feature flag for easy toggling, a unified data structure for engine-agnostic stats, and a non-blocking approach for fetching the data. The code is clean, well-tested, and the changes are consistently applied across the codebase. I have one suggestion to make the code even more idiomatic.

model_gateway/src/routers/grpc/proto_wrapper.rs

Signed-off-by: Scott Lee <scott@together.ai>

mergify · 2026-03-18T18:29:00Z

Hi @scottjlee, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5135d1295f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T18:35:57Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+                        self._store_request_stats(
+                            request_id,
+                            final_response,


Use the yielded streaming response when persisting stats

final_response is only assigned inside the if not is_stream branch above, so on every stream=true request this call raises UnboundLocalError as soon as the finishing chunk arrives. Because _store_request_stats(...) is unconditional, the regression is not gated by --enable-request-statistics: successful SGLang streams can now terminate with an INTERNAL error at the end of the response instead of closing cleanly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T18:35:57Z

model_gateway/src/observability/events.rs

+        if let Some(rate) = other.spec_decoding_acceptance_rate {
+            self.spec_decoding_acceptance_rate.get_or_insert(rate);


Aggregate per-sample acceptance rates before logging

ProtoStream::spawn_stats_emission() reduces one RequestStats per sample into a single UnifiedRequestStats, but these lines keep the first sample's spec_decoding_acceptance_rate instead of combining the values. For n>1 requests with speculative decoding enabled, different samples can have very different acceptance rates, so the emitted request_stats event reports an arbitrary sample-0 number while completion_tokens has already been summed across all samples.

Useful? React with 👍 / 👎.

wasn't sure what would be the most appropriate aggregation way for spec acceptance rate across multiple samples, so i just kept the first value. maybe max is most appropriate?

Signed-off-by: Scott Lee <scott@together.ai>

mergify · 2026-03-18T18:58:54Z

Hi @scottjlee, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5db99ddb4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T19:03:50Z

model_gateway/src/routers/grpc/regular/streaming.rs

    pub async fn process_messages_streaming_chunks(
        &self,
-        mut grpc_stream: ProtoStream,
+        mut grpc_stream: StatsProtoStream,


Emit request stats from Messages API streams

After switching this handler to StatsProtoStream, it still only calls mark_completed() before returning; unlike the other streaming handlers, it never calls spawn_stats_emission(). With --enable-request-statistics, every successful streaming /v1/messages request therefore finishes without a request_stats event, so the new observability feature silently does nothing for Anthropic-style streaming traffic.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T19:03:50Z

model_gateway/src/routers/grpc/harmony/streaming.rs

    /// Process decode stream for tool call events.
    async fn process_decode_stream(
-        mut decode_stream: ProtoStream,
+        mut decode_stream: StatsProtoStream,


Emit request stats from Harmony Responses streams

This decode path now accepts StatsProtoStream, but after the stream is marked completed it returns without ever invoking spawn_stats_emission(). When --enable-request-statistics is on, successful streaming /v1/responses requests on the Harmony backend—including tool-call iterations—will never log the new request-level stats event.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

model_gateway/src/routers/grpc/regular/streaming.rs (2)
1517-1525: ⚠️ Potential issue | 🟠 Major

Messages streaming never uses the stats wrapper.

This helper now accepts StatsProtoStream, but it returns after grpc_stream.mark_completed() on Line 2094 without ever calling grpc_stream.spawn_stats_emission(). --enable-request-statistics is therefore a no-op for the Messages streaming endpoint, unlike the chat and generate paths in this file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1517 -
1525, The process_messages_streaming_chunks function accepts a StatsProtoStream
but never starts stats emission: before calling grpc_stream.mark_completed()
(and before returning) invoke grpc_stream.spawn_stats_emission() so the request
statistics are emitted for the Messages streaming endpoint; ensure
spawn_stats_emission is called (and awaited or left to run as its API requires)
prior to or immediately after mark_completed in
process_messages_streaming_chunks so --enable-request-statistics is not a no-op.
179-185: ⚠️ Potential issue | 🟠 Major

Finalize StatsProtoStream before the trailer SSE writes.

After the read loop is exhausted, phases 3-5 still do several fallible tx.send(...) calls. If the client disconnects there, control returns before Line 547 and Line 563, so a fully-consumed backend stream is still dropped as aborted and never schedules request-stats emission. Please move mark_completed() / spawn_stats_emission() into the post-read path before those trailer writes.

Also applies to: 563-563
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 179 - 185,
The finalization of StatsProtoStream must occur immediately after the read loop
completes and before any fallible tx.send(...) trailer writes: call
grpc_stream.mark_completed() and then context::spawn_stats_emission(grpc_stream,
dispatch, original_request) (or the existing spawn_stats_emission helper you
use) in the post-read path — before performing the subsequent tx.send(...)
trailer SSE writes — so that a fully-consumed backend stream is finalized and
stats emission is scheduled even if a client disconnects during the trailer
sends; update the code paths that currently call
mark_completed()/spawn_stats_emission() after the tx.send(...) calls (including
the branch covering the second trailer) to instead finalize and spawn emission
first, then proceed with tx.send(...) and handle send errors as before.
model_gateway/src/routers/grpc/harmony/streaming.rs (1)
527-533: ⚠️ Potential issue | 🟠 Major

Emit request stats from the Harmony Responses helper.

process_decode_stream() now accepts StatsProtoStream, but it still stops at mark_completed() on Line 1010. Both single- and dual-stream Responses requests funnel through this helper, so --enable-request-statistics never emits request-level stats for Harmony Responses streaming.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/streaming.rs` around lines 527 - 533,
process_decode_stream currently takes a StatsProtoStream but never emits
request-level stats because it stops at the existing mark_completed() call (see
mark_completed reference in this function); update the end-of-stream/cleanup
logic in process_decode_stream so that when the incoming StatsProtoStream
contains request statistics you call the request-level stats
emitter/mark_completed path (i.e., invoke the same request-statistics completion
that the dual-stream Responses code uses) before returning, ensuring request
stats are emitted for both single- and dual-stream Harmony Responses; look for
process_decode_stream, StatsProtoStream and mark_completed to add the
conditional completion call and emit the request-level stats.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/servicer.py`:
- Around line 562-577: Replace the current error-handling in GetRequestStats to
follow the servicer-wide convention: after logging the exception (using
logger.error with get_exception_traceback()), call await
context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of
context.set_code/context.set_details and returning an empty response; locate the
GetRequestStats method and change its except block accordingly (references:
GetRequestStats, request_manager.get_request_stats, _build_request_stats,
context.abort). Also apply the same fix to GetLoads so both methods use await
context.abort(...) on exceptions.

In `@model_gateway/src/observability/events.rs`:
- Around line 147-152: The current merge uses get_or_insert for
spec_decoding_acceptance_rate, which preserves the first sample's per-sample
rate; instead compute a deterministic completion_tokens-weighted average using
self.completion_tokens and other.completion_tokens and set
self.spec_decoding_acceptance_rate = Some(weighted_rate). If one side is None,
fall back to the existing Some; if both are Some but total_tokens == 0, pick a
deterministic choice (e.g., average the two or keep self). Update the merge
logic in the method that handles merging events (the code around
spec_decoding_acceptance_rate and completion_tokens) to compute and assign the
weighted value rather than calling get_or_insert.

In `@model_gateway/src/routers/grpc/pipeline.rs`:
- Around line 129-132: The pipeline tags RequestExecutionStage with
BACKEND_HARMONY but new_harmony() still sets backend_type to BACKEND_REGULAR;
update new_harmony() to set backend_type to BACKEND_HARMONY so the backend label
is consistent. Locate the new_harmony() constructor (and any Pipeline or Router
builder that initializes backend_type) and replace the BACKEND_REGULAR
assignment with BACKEND_HARMONY so RequestExecutionStage::new(...,
metrics_labels::BACKEND_HARMONY) and backend_type use the same symbol.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/harmony/streaming.rs`:
- Around line 527-533: process_decode_stream currently takes a StatsProtoStream
but never emits request-level stats because it stops at the existing
mark_completed() call (see mark_completed reference in this function); update
the end-of-stream/cleanup logic in process_decode_stream so that when the
incoming StatsProtoStream contains request statistics you call the request-level
stats emitter/mark_completed path (i.e., invoke the same request-statistics
completion that the dual-stream Responses code uses) before returning, ensuring
request stats are emitted for both single- and dual-stream Harmony Responses;
look for process_decode_stream, StatsProtoStream and mark_completed to add the
conditional completion call and emit the request-level stats.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 1517-1525: The process_messages_streaming_chunks function accepts
a StatsProtoStream but never starts stats emission: before calling
grpc_stream.mark_completed() (and before returning) invoke
grpc_stream.spawn_stats_emission() so the request statistics are emitted for the
Messages streaming endpoint; ensure spawn_stats_emission is called (and awaited
or left to run as its API requires) prior to or immediately after mark_completed
in process_messages_streaming_chunks so --enable-request-statistics is not a
no-op.
- Around line 179-185: The finalization of StatsProtoStream must occur
immediately after the read loop completes and before any fallible tx.send(...)
trailer writes: call grpc_stream.mark_completed() and then
context::spawn_stats_emission(grpc_stream, dispatch, original_request) (or the
existing spawn_stats_emission helper you use) in the post-read path — before
performing the subsequent tx.send(...) trailer SSE writes — so that a
fully-consumed backend stream is finalized and stats emission is scheduled even
if a client disconnects during the trailer sends; update the code paths that
currently call mark_completed()/spawn_stats_emission() after the tx.send(...)
calls (including the branch covering the second trailer) to instead finalize and
spawn emission first, then proceed with tx.send(...) and handle send errors as
before.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e2cc528-8d2d-4bdf-a7a7-a7ed74b472c5

📥 Commits

Reviewing files that changed from the base of the PR and between ce6f52e and e5db99d.

📒 Files selected for processing (17)

crates/grpc_client/proto/sglang_scheduler.proto
crates/grpc_client/src/sglang_scheduler.rs
grpc_servicer/smg_grpc_servicer/sglang/request_manager.py
grpc_servicer/smg_grpc_servicer/sglang/servicer.py
model_gateway/src/config/builder.rs
model_gateway/src/config/types.rs
model_gateway/src/main.rs
model_gateway/src/observability/events.rs
model_gateway/src/routers/grpc/common/response_collection.rs
model_gateway/src/routers/grpc/common/stages/request_execution.rs
model_gateway/src/routers/grpc/context.rs
model_gateway/src/routers/grpc/harmony/streaming.rs
model_gateway/src/routers/grpc/pd_router.rs
model_gateway/src/routers/grpc/pipeline.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/router.rs

coderabbitai · 2026-03-18T19:15:10Z

grpc_servicer/smg_grpc_servicer/sglang/servicer.py

+    async def GetRequestStats(
+        self,
+        request: sglang_scheduler_pb2.GetRequestStatsRequest,
+        context: grpc.aio.ServicerContext,
+    ) -> sglang_scheduler_pb2.GetRequestStatsResponse:
+        """Get request-level statistics for a completed request."""
+        try:
+            stored = self.request_manager.get_request_stats(request.request_id)
+            stats = [self._build_request_stats(entry) for entry in stored]
+            return sglang_scheduler_pb2.GetRequestStatsResponse(stats=stats)
+        except Exception as e:
+            logger.error(f"GetRequestStats failed: {e}\n{get_exception_traceback()}")
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(f"Failed to get request stats: {e}")
+            return sglang_scheduler_pb2.GetRequestStatsResponse()
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify INTERNAL error handling convention usage in servicer implementations. rg -n --type=py 'context\.abort\(grpc\.StatusCode\.INTERNAL,\s*str\(e\)\)' grpc_servicer/smg_grpc_servicer -C2 rg -n --type=py 'context\.set_code\(grpc\.StatusCode\.INTERNAL\)|context\.set_details\(' grpc_servicer/smg_grpc_servicer/sglang/servicer.py -C2

Repository: lightseekorg/smg

Length of output: 2568

Use await context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of set_code/set_details + empty response.

Lines 574–575 diverge from the servicer-wide INTERNAL error handling convention (observed in Embed, Generate, and HealthCheck elsewhere in this file). Use the established pattern of logging server-side and calling context.abort() with the exception string.

🔧 Proposed fix

except Exception as e: logger.error(f"GetRequestStats failed: {e}\n{get_exception_traceback()}") - context.set_code(grpc.StatusCode.INTERNAL) - context.set_details(f"Failed to get request stats: {e}") - return sglang_scheduler_pb2.GetRequestStatsResponse() + await context.abort(grpc.StatusCode.INTERNAL, str(e))

Note: GetLoads (lines 548–549) has the same issue and should be fixed similarly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/sglang/servicer.py` around lines 562 - 577, Replace the current error-handling in GetRequestStats to follow the servicer-wide convention: after logging the exception (using logger.error with get_exception_traceback()), call await context.abort(grpc.StatusCode.INTERNAL, str(e)) instead of context.set_code/context.set_details and returning an empty response; locate the GetRequestStats method and change its except block accordingly (references: GetRequestStats, request_manager.get_request_stats, _build_request_stats, context.abort). Also apply the same fix to GetLoads so both methods use await context.abort(...) on exceptions.

coderabbitai · 2026-03-18T19:15:10Z

model_gateway/src/observability/events.rs

+        if let Some(rate) = other.cache_hit_rate {
+            self.cache_hit_rate.get_or_insert(rate);
+        }
+        if let Some(rate) = other.spec_decoding_acceptance_rate {
+            self.spec_decoding_acceptance_rate.get_or_insert(rate);
+        }


⚠️ Potential issue | 🟠 Major

Don’t keep the first sample’s spec-decoding rate.

For n > 1, spec_decoding_acceptance_rate is per-sample. get_or_insert makes the merged request-level stat depend on whichever sample happens to be reduced first, even though completion_tokens are already being aggregated across samples. Please merge this deterministically instead, e.g. with a completion_tokens-weighted average.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/observability/events.rs` around lines 147 - 152, The current merge uses get_or_insert for spec_decoding_acceptance_rate, which preserves the first sample's per-sample rate; instead compute a deterministic completion_tokens-weighted average using self.completion_tokens and other.completion_tokens and set self.spec_decoding_acceptance_rate = Some(weighted_rate). If one side is None, fall back to the existing Some; if both are Some but total_tokens == 0, pick a deterministic choice (e.g., average the two or keep self). Update the merge logic in the method that handles merging events (the code around spec_decoding_acceptance_rate and completion_tokens) to compute and assign the weighted value rather than calling get_or_insert.

model_gateway/src/routers/grpc/pipeline.rs

Signed-off-by: Scott Lee <scott@together.ai>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7d3dbd41a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T20:34:31Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+                self._handle_single_request(
+                    gen_obj,
+                    gen_request_id,
+                    grpc_context,
+                    parent_request_id=base_request_id,


Include the prefix-caching phase in n>1 request stats

For any SGLang request with n > 1, generate_request() still performs the separate base_request_id-prefix prefill round-trip just above, but only the phase-2 children are copied onto the parent request here via parent_request_id=base_request_id. That means GetRequestStats(base_request_id) omits the entire prefix-caching phase, so the new request_stats event under-reports latency and prompt-side work for every multi-sample request.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T20:34:31Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+        if time_stats.response_sent_to_client_time > 0.0:
+            meta_info.setdefault(
+                "response_sent_timestamp_s",
+                time_stats.get_response_sent_to_client_realtime(),


Record response_sent timestamp before exporting stats

_populate_timestamps() now advertises response_sent_timestamp_s, but this gRPC request manager never records response_sent_to_client_time before _store_request_stats() persists the final meta. As a result, every emitted request_stats event will leave response_sent_timestamp_s unset, so any dashboard or experiment that needs server-to-client send latency cannot use the new field.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T20:34:31Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+        state.time_stats.set_finished_time()
+
+        meta: dict[str, Any] = state.time_stats.convert_to_output_meta_info(


Preserve the scheduler finish time when building meta

_handle_batch_output() already stamps finished_time as soon as the scheduler marks the request complete, but this new call overwrites it later when the final item is dequeued for gRPC. On slow consumers or a backed-up out_queue, request_finished_timestamp_s and the derived latency fields now include post-engine delivery delay instead of actual engine completion time, which skews the request_stats event.

Useful? React with 👍 / 👎.

Signed-off-by: Scott Lee <scott@together.ai>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py`:
- Around line 876-880: The current checks assume batch_out.spec_verify_ct and
batch_out.spec_accepted_tokens are non-None and index-aligned, which can raise
when they are None or shorter; update the conditional around the completion-path
stats emission to first guard each array for truthiness/None and length (e.g.,
ensure batch_out.spec_verify_ct is not None and index <
len(batch_out.spec_verify_ct) and batch_out.spec_verify_ct[index] > 0, and
similarly ensure batch_out.spec_accepted_tokens is not None and index <
len(batch_out.spec_accepted_tokens)) before any len() or index access; apply the
same defensive checks to the later block that uses these arrays (the code around
lines referenced 883-885) so completion-path stats emission won’t throw when
arrays are missing or misaligned.
- Around line 938-944: The eviction currently uses insertion-order on
self.completed_request_stats so parent aggregates can be evicted even after
updates; when you write a parent aggregate (in the block that checks
parent_request_id and sets parent_entries using parent_index and entry), refresh
that key's recency before eviction by "touching" it: either remove and re-insert
the parent_request_id key or call OrderedDict.move_to_end(parent_request_id) if
completed_request_stats is an OrderedDict, then assign
parent_entries[parent_index] = {**entry, "index": parent_index}; keep the
existing while loop that trims entries to self._MAX_COMPLETED_STATS_CACHE so the
FIFO eviction still runs on truly stale keys. This ensures updates to parent
aggregates prevent premature eviction.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ec26e84-7760-4c2e-81b2-843d76870903

📥 Commits

Reviewing files that changed from the base of the PR and between e7d3dbd and 2e29ece.

📒 Files selected for processing (1)

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

coderabbitai · 2026-03-18T21:20:17Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+        if not (
+            len(batch_out.spec_verify_ct) > index
+            and batch_out.spec_verify_ct[index] > 0
+            and len(batch_out.spec_accepted_tokens) > index
+        ):


⚠️ Potential issue | 🟠 Major

Guard speculative arrays before len/index access.

Line 877 assumes batch_out.spec_verify_ct and batch_out.spec_accepted_tokens are always present and index-aligned. If either is None or shorter than index, this can raise and skip completion-path stats emission.

Proposed defensive fix

def _populate_spec_decoding_metrics( self, meta_info: dict[str, Any], batch_out: BatchTokenIDOutput, index: int, ) -> None: """Populate speculative decoding stats when available.""" - if not ( - len(batch_out.spec_verify_ct) > index - and batch_out.spec_verify_ct[index] > 0 - and len(batch_out.spec_accepted_tokens) > index - ): + spec_verify_ct = batch_out.spec_verify_ct or [] + spec_accepted_tokens = batch_out.spec_accepted_tokens or [] + if not ( + index < len(spec_verify_ct) + and spec_verify_ct[index] > 0 + and index < len(spec_accepted_tokens) + ): return - verify_ct = batch_out.spec_verify_ct[index] - accepted_tokens = batch_out.spec_accepted_tokens[index] + verify_ct = spec_verify_ct[index] + accepted_tokens = spec_accepted_tokens[index]

Also applies to: 883-885

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 876 - 880, The current checks assume batch_out.spec_verify_ct and batch_out.spec_accepted_tokens are non-None and index-aligned, which can raise when they are None or shorter; update the conditional around the completion-path stats emission to first guard each array for truthiness/None and length (e.g., ensure batch_out.spec_verify_ct is not None and index < len(batch_out.spec_verify_ct) and batch_out.spec_verify_ct[index] > 0, and similarly ensure batch_out.spec_accepted_tokens is not None and index < len(batch_out.spec_accepted_tokens)) before any len() or index access; apply the same defensive checks to the later block that uses these arrays (the code around lines referenced 883-885) so completion-path stats emission won’t throw when arrays are missing or misaligned.

coderabbitai · 2026-03-18T21:20:17Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+        if parent_request_id is not None:
+            parent_entries = self.completed_request_stats.setdefault(parent_request_id, {})
+            parent_entries[parent_index] = {**entry, "index": parent_index}
+
+        while len(self.completed_request_stats) > self._MAX_COMPLETED_STATS_CACHE:
+            oldest = next(iter(self.completed_request_stats))
+            del self.completed_request_stats[oldest]


⚠️ Potential issue | 🟠 Major

Eviction policy can drop in-progress parent aggregates.

Line 942 evicts by oldest insertion order, but updates do not refresh key recency. Under high throughput, a parent aggregate key can be evicted while later child samples are still arriving, returning partial stats for the parent request.

Proposed fix to “touch” keys on write before FIFO eviction

- entries = self.completed_request_stats.setdefault(request_id, {}) + entries = self.completed_request_stats.pop(request_id, {}) entries[index] = entry + self.completed_request_stats[request_id] = entries - if parent_request_id is not None: - parent_entries = self.completed_request_stats.setdefault(parent_request_id, {}) + if parent_request_id is not None and parent_request_id != request_id: + parent_entries = self.completed_request_stats.pop(parent_request_id, {}) parent_entries[parent_index] = {**entry, "index": parent_index} + self.completed_request_stats[parent_request_id] = parent_entries while len(self.completed_request_stats) > self._MAX_COMPLETED_STATS_CACHE: oldest = next(iter(self.completed_request_stats)) del self.completed_request_stats[oldest]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 938 - 944, The eviction currently uses insertion-order on self.completed_request_stats so parent aggregates can be evicted even after updates; when you write a parent aggregate (in the block that checks parent_request_id and sets parent_entries using parent_index and entry), refresh that key's recency before eviction by "touching" it: either remove and re-insert the parent_request_id key or call OrderedDict.move_to_end(parent_request_id) if completed_request_stats is an OrderedDict, then assign parent_entries[parent_index] = {**entry, "index": parent_index}; keep the existing while loop that trims entries to self._MAX_COMPLETED_STATS_CACHE so the FIFO eviction still runs on truly stale keys. This ensures updates to parent aggregates prevent premature eviction.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e29ece8e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T21:22:43Z

model_gateway/src/routers/grpc/common/response_collection.rs

            // Mark both streams as completed now that both succeeded
            prefill.mark_completed();
            decode_stream.mark_completed();
+            decode_stream.spawn_stats_emission();


Fetch request stats from both PD workers

Checked the SGLang PD path in request_execution.rs::execute_dual_dispatch(): it launches separate prefill and decode RPCs, but this block only calls spawn_stats_emission() on decode_stream. For any non-streaming disaggregated request, the emitted request_stats event will therefore omit the prefill worker's latency/cache work and systematically under-report prompt-side cost whenever PD mode is enabled.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T21:22:43Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+            if is_finished:
+                meta_info.update(self._build_finished_meta(state, batch_out, i))


Stamp finished_time before building request stats

_build_finished_meta() reads state.time_stats.finished_time via _populate_timestamps(), but here it runs before the completion branch later calls state.time_stats.set_finished_time(). On normal completions that leaves request_finished_timestamp_s unset in every stored RequestStats, so the new observability event cannot report when the engine actually finished the request.

Useful? React with 👍 / 👎.

Signed-off-by: Scott Lee <scott@together.ai>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py`:
- Around line 615-631: The code indexes batch_out.finished_reasons without a
None check while other per-item arrays are guarded, risking a TypeError; update
the access in the is_finished assignment and the meta_info "finish_reason"
population to mirror the defensive pattern used for
prompt_tokens/completion_tokens/cached_tokens: first test if
batch_out.finished_reasons is truthy (or not None) and only then index [i],
otherwise default is_finished to False and finish_reason to None, and keep
updating state.last_* accordingly (refer to batch_out, finished_reasons,
is_finished, meta_info, and
state.last_prompt_tokens/last_completion_tokens/last_cached_tokens).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5270fa41-f726-4c68-91c2-19270e88a96c

📥 Commits

Reviewing files that changed from the base of the PR and between 2e29ece and 37b196e.

📒 Files selected for processing (1)

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

coderabbitai · 2026-03-19T19:14:13Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+            prompt_tokens = batch_out.prompt_tokens[i] if batch_out.prompt_tokens else 0
+            completion_tokens = batch_out.completion_tokens[i] if batch_out.completion_tokens else 0
+            cached_tokens = batch_out.cached_tokens[i] if batch_out.cached_tokens else 0
+            is_finished = batch_out.finished_reasons[i] is not None
+
+            state.last_prompt_tokens = prompt_tokens
+            state.last_completion_tokens = completion_tokens
+            state.last_cached_tokens = cached_tokens
+
+            meta_info: dict[str, Any] = {
+                "prompt_tokens": prompt_tokens,
+                "completion_tokens": completion_tokens,
+                "cached_tokens": cached_tokens,
+                "finish_reason": (
+                    batch_out.finished_reasons[i] if batch_out.finished_reasons[i] else None
+                ),
+            }


⚠️ Potential issue | 🟡 Minor

Inconsistent defensive access for finished_reasons.

Lines 615-617 defensively check prompt_tokens, completion_tokens, and cached_tokens for None before indexing, but finished_reasons is accessed directly at lines 618 and 628-630 without a similar guard. If finished_reasons could ever be None (as the pattern suggests for other arrays), this would raise TypeError.

🛡️ Proposed fix for consistency

- is_finished = batch_out.finished_reasons[i] is not None + is_finished = ( + batch_out.finished_reasons is not None + and batch_out.finished_reasons[i] is not None + ) state.last_prompt_tokens = prompt_tokens state.last_completion_tokens = completion_tokens state.last_cached_tokens = cached_tokens meta_info: dict[str, Any] = { "prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens, "cached_tokens": cached_tokens, "finish_reason": ( - batch_out.finished_reasons[i] if batch_out.finished_reasons[i] else None + batch_out.finished_reasons[i] + if batch_out.finished_reasons and batch_out.finished_reasons[i] + else None ), }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@grpc_servicer/smg_grpc_servicer/sglang/request_manager.py` around lines 615 - 631, The code indexes batch_out.finished_reasons without a None check while other per-item arrays are guarded, risking a TypeError; update the access in the is_finished assignment and the meta_info "finish_reason" population to mirror the defensive pattern used for prompt_tokens/completion_tokens/cached_tokens: first test if batch_out.finished_reasons is truthy (or not None) and only then index [i], otherwise default is_finished to False and finish_reason to None, and keep updating state.last_* accordingly (refer to batch_out, finished_reasons, is_finished, meta_info, and state.last_prompt_tokens/last_completion_tokens/last_cached_tokens).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37b196ee91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T19:14:46Z

grpc_servicer/smg_grpc_servicer/sglang/request_manager.py

+                        if state.time_stats.response_sent_to_client_time == 0.0:
+                            state.time_stats.set_response_sent_to_client_time()


Stamp streaming response_sent_timestamp_s after the final chunk

For any SGLang streaming request with more than one chunk, this records response_sent_to_client_time on the first chunk and never updates it because of the == 0.0 guard. The new response_sent_timestamp_s field therefore collapses to a TTFT-like timestamp instead of “response finished streaming”, so dashboards derived from these stats will systematically under-report end-to-end send latency on successful streams.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T19:14:46Z

model_gateway/src/routers/grpc/regular/streaming.rs

+        mut prefill_stream: StatsProtoStream,
+        decode_stream: StatsProtoStream,


Emit request stats for the prefill half of PD streams

Checked the regular PD streaming helpers in model_gateway/src/routers/grpc/regular/streaming.rs: after the decode side succeeds they only call prefill_stream.mark_completed(), but never prefill_stream.spawn_stats_emission(). With --enable-request-statistics, disaggregated streaming chat requests therefore log only decode-worker stats, so prompt-side latency/cache work from the prefill worker disappears; the same pattern is repeated in process_generate_streaming_dual() for streaming /generate traffic.

Useful? React with 👍 / 👎.

scottjlee and others added 30 commits March 4, 2026 17:07

working with existing proto changes

3e7dbe6

generalize to be engine agnostic

91b47bb

gate with smg launch flag

cf3707a

add field for error message

514def1

Merge remote-tracking branch 'origin' into sjl/0302-unify-request-stats

fd4835c

address comments - proto related changes

f5ec6a0

address comments - simplify api surface

3614fb6

address comments - reduce passing of enable_request_statistics param

0cde6ee

address comments - remove unnecessary time fallback, empty layer fns

229dcf3

Merge branch 'main' into sjl/0302-unify-request-stats

08d3e49

get stats from rpc

c19eafb

Merge branch 'main' into sjl/0302-unify-request-stats

c8b9aa7

simplify fallback to response, handle multisample aggregation

1198844

Merge branch 'main' into sjl/0302-unify-request-stats

1a179c1

add handling for response_sent_timestamp_s field

4fc27b6

clean up

3aae0bc

simplify client connection usage

6ca3421

clean up proto wrappers

17fab29

fmt

2566c12

Merge branch 'main' into sjl/0302-unify-request-stats

10bcf42

remove indexing and fetch stats after request completion

7ef0fec

simplify emitting stats event

3d52b19

Merge branch 'main' into sjl/0302-unify-request-stats

0963d74

unit tests

a07e622

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

e6827a5

Signed-off-by: Scott Lee <scott@together.ai>

group tests

8e3a792

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

94fb539

Signed-off-by: Scott Lee <scott@together.ai>

fetch stats via background task

e3ac158

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

d0b69e8

clean up

3de1028

Signed-off-by: Scott Lee <scott@together.ai>

scottjlee mentioned this pull request Mar 13, 2026

Add logging request level statistics from engine #693

Closed

3 tasks

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

model_gateway/src/routers/grpc/proto_wrapper.rs Outdated Show resolved Hide resolved

scottjlee added 3 commits March 13, 2026 16:17

Merge branch 'main' into sjl/0302-unify-request-stats

e98a3fd

Signed-off-by: Scott Lee <scott@together.ai>

add timeout fallback

0f81178

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

0619f7f

Signed-off-by: Scott Lee <scott@together.ai>

coderabbitai bot approved these changes Mar 16, 2026

View reviewed changes

scottjlee added 4 commits March 17, 2026 18:38

port python grpc layer

6015beb

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

8200edd

Signed-off-by: Scott Lee <scott@together.ai>

clean up

5009619

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

5135d12

Signed-off-by: Scott Lee <scott@together.ai>

scottjlee marked this pull request as ready for review March 18, 2026 18:28

scottjlee requested review from CatherineSue, key4ng and slin1237 as code owners March 18, 2026 18:28

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

address codex comment

e5db99d

Signed-off-by: Scott Lee <scott@together.ai>

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

coderabbitai bot requested changes Mar 18, 2026

View reviewed changes

fix harmony backend

e7d3dbd

Signed-off-by: Scott Lee <scott@together.ai>

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

address codex comment

2e29ece

Signed-off-by: Scott Lee <scott@together.ai>

coderabbitai bot requested changes Mar 18, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

scottjlee added 2 commits March 19, 2026 12:06

fix timestamps

6a4735f

Signed-off-by: Scott Lee <scott@together.ai>

Merge branch 'main' into sjl/0302-unify-request-stats

37b196e

Signed-off-by: Scott Lee <scott@together.ai>

coderabbitai bot requested changes Mar 19, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

		if let Some(rate) = other.spec_decoding_acceptance_rate {
		self.spec_decoding_acceptance_rate.get_or_insert(rate);

		state.time_stats.set_finished_time()

		meta: dict[str, Any] = state.time_stats.convert_to_output_meta_info(

		if is_finished:
		meta_info.update(self._build_finished_meta(state, batch_out, i))

		if state.time_stats.response_sent_to_client_time == 0.0:
		state.time_stats.set_response_sent_to_client_time()

		mut prefill_stream: StatsProtoStream,
		decode_stream: StatsProtoStream,

Conversation

scottjlee commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

scottjlee Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

scottjlee commented Mar 13, 2026 •

edited by coderabbitai bot

Loading