Skip to content

Add logging request level statistics from engine#693

Closed
scottjlee wants to merge 30 commits intolightseekorg:mainfrom
scottjlee:sjl/0302-unify-request-stats
Closed

Add logging request level statistics from engine#693
scottjlee wants to merge 30 commits intolightseekorg:mainfrom
scottjlee:sjl/0302-unify-request-stats

Conversation

@scottjlee
Copy link
Collaborator

@scottjlee scottjlee commented Mar 10, 2026

! WIP - Not ready for review !

Description

WIP, add logging request-level statistics from engine. Create abstraction for general engines, and implement concrete SGLang version.

Problem

Solution

Changes

Test Plan

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated

@github-actions github-actions bot added grpc gRPC client and router changes model-gateway Model gateway crate changes labels Mar 10, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4a655f2a-32fd-406b-b662-5e392187eecc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of the model gateway by introducing a robust system for collecting and logging request-level statistics from various backend engines. It establishes new gRPC definitions for these statistics, integrates them into core response messages, and provides a unified, configurable mechanism for processing and emitting these performance metrics. This improvement offers deeper insights into request performance and behavior, particularly for streaming and aborted requests, aiding in debugging and performance analysis.

Highlights

  • New gRPC Request Statistics RPC and Message: Introduced a new GetRequestStats RPC to the SglangScheduler service and defined a RequestStats message to capture detailed request-level metrics, including timestamps, cache hit rates, and token counts.
  • Integration of Request Statistics into Responses: The RequestStats message is now included as an optional field in GenerateResponse (as a new variant) and AbortResponse messages, allowing for statistics to be returned upon request completion or abortion.
  • Configurable Request Statistics Collection: Added a new configuration option, enable_request_statistics, to the model_gateway which can be controlled via CLI arguments, allowing operators to enable or disable the collection and logging of these metrics.
  • Unified Observability for Request Statistics: Implemented RequestStatsEvent and UnifiedRequestStats structures to standardize the collection, aggregation, and emission of request-level statistics across different backend engines (SGLang, vLLM, TRTLLM) for consistent logging.
  • Enhanced Stream Response Processing: Updated streaming and non-streaming response collection logic to process the new RequestStats variants from gRPC streams and aggregate them into UnifiedRequestStats when the feature is enabled. This includes specific handling for client disconnects and errors during streaming, ensuring statistics are logged with appropriate status codes.
Changelog
  • crates/grpc_client/proto/sglang_scheduler.proto
    • Added GetRequestStats RPC to SglangScheduler service.
    • Defined new RequestStats message with detailed metrics.
    • Integrated RequestStats into GenerateResponse and AbortResponse messages.
  • crates/grpc_client/src/sglang_scheduler.rs
    • Modified AbortOnDropStream::abort to return proto::AbortResponse.
    • Updated SglangSchedulerClient::abort_request to return proto::AbortResponse.
  • crates/grpc_client/src/trtllm_service.rs
    • Updated AbortOnDropStream::abort to return proto::AbortResponse.
  • crates/grpc_client/src/vllm_engine.rs
    • Updated AbortOnDropStream::abort to return proto::AbortResponse.
    • Modified VllmEngineClient::abort_request to return proto::AbortResponse.
  • model_gateway/src/config/builder.rs
    • Added enable_request_statistics builder method for RouterConfigBuilder.
  • model_gateway/src/config/types.rs
    • Added enable_request_statistics field to RouterConfig with a default of false.
  • model_gateway/src/main.rs
    • Introduced --enable-request-statistics CLI argument.
    • Integrated enable_request_statistics argument into RouterConfig construction.
  • model_gateway/src/observability/events.rs
    • Defined RequestStatsFieldMapping for backend-specific field names.
    • Created UnifiedRequestStats for normalized request-level statistics.
    • Added RequestStatsEvent for emitting unified request statistics.
    • Implemented helper functions format_optional_f64 and format_optional_u16.
  • model_gateway/src/routers/grpc/common/response_collection.rs
    • Introduced CollectedResponses struct to return both completes and request stats.
    • Modified collect_responses to accept enable_request_statistics and return CollectedResponses.
  • model_gateway/src/routers/grpc/harmony/processor.rs
    • Updated HarmonyResponseProcessor to accept enable_request_statistics.
    • Modified process_chat_completion_request and process_responses_request to use CollectedResponses.
    • Added logic to emit RequestStatsEvent when request_stats are available.
  • model_gateway/src/routers/grpc/harmony/stages/response_processing.rs
    • Updated HarmonyResponseProcessingStage::new to accept enable_request_statistics and pass it to processors.
  • model_gateway/src/routers/grpc/harmony/streaming.rs
    • Updated HarmonyStreamingProcessor to accept enable_request_statistics.
    • Modified streaming processing functions to collect ProtoRequestStats and emit RequestStatsEvent.
    • Adjusted error handling during streaming to emit RequestStatsEvent on errors.
    • Removed DEFAULT_SERVER_LABEL constant and replaced with string literal 'mcp'.
  • model_gateway/src/routers/grpc/pd_router.rs
    • Passed enable_request_statistics from RouterConfig to pipeline creation.
  • model_gateway/src/routers/grpc/pipeline.rs
    • Updated RequestPipeline constructors to accept and pass enable_request_statistics to response processors.
  • model_gateway/src/routers/grpc/proto_wrapper.rs
    • Added ProtoRequestStats enum for unified request stats payloads.
    • Integrated RequestStats into ProtoGenerateResponse::into_response.
    • Implemented EngineRequestStatsMapper trait for SGLang, vLLM, and TRTLLM.
    • Added functions collect_unified_request_stats, collect_stream_request_stats, collect_request_stats, and collect_sglang_request_stats for aggregating statistics.
    • Added http_status_code method to ProtoGenerateError.
    • Added abort_with_request_stats method to ProtoStream to return request stats on abort.
  • model_gateway/src/routers/grpc/regular/processor.rs
    • Updated ResponseProcessor to accept enable_request_statistics.
    • Modified process_chat_completion_request and process_generate_request to use CollectedResponses.
    • Added logic to emit RequestStatsEvent when request_stats are available.
  • model_gateway/src/routers/grpc/regular/streaming.rs
    • Updated StreamingProcessor and GenerateStreamContext to include enable_request_statistics.
    • Modified streaming processing functions to collect ProtoRequestStats and emit RequestStatsEvent.
    • Implemented send_sse_or_abort helper to handle client disconnects and backend aborts with stats collection.
    • Adjusted error handling during streaming to emit RequestStatsEvent on errors and client disconnects.
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
    • Moved collect_stream_responses into this file.
    • Updated collect_stream_responses to return CollectedStreamResponses including UnifiedRequestStats.
Activity
  • New gRPC definitions for request-level statistics were added to sglang_scheduler.proto.
  • The RequestStats message was integrated into GenerateResponse and AbortResponse across SGLang, vLLM, and TRTLLM client implementations.
  • A new configuration option enable_request_statistics was introduced in the model gateway, exposed via CLI, to control this feature.
  • Unified data structures (UnifiedRequestStats, RequestStatsEvent) were created for consistent logging of statistics.
  • Response collection and streaming logic in both regular and Harmony routers were updated to gather and emit these new statistics, including handling for errors and client disconnects.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new feature to collect and emit request-level statistics across different backend engines (SGLang, vLLM, TRTLLM). The changes involve updating .proto definitions to include RequestStats in GenerateResponse and AbortResponse, modifying gRPC client implementations to propagate these statistics, and adding a configuration option to enable/disable this feature. New data structures (RequestStatsFieldMapping, UnifiedRequestStats, RequestStatsEvent) are introduced in the observability module to normalize and emit these statistics. Response collection and streaming logic in the Harmony and regular gRPC routers are updated to process and log these RequestStats when enabled. Review comments suggest refactoring duplicated logging logic in RequestStatsEvent::emit for better maintainability and simplifying match statements with the ? operator in several response processing functions. Additionally, a duplicate and unused collect_stream_responses function in chat_utils.rs should be removed.

Comment on lines +118 to +175
impl Event for RequestStatsEvent<'_> {
#[inline]
fn emit(&self) {
let request_received_timestamp_s =
format_optional_f64(self.stats.request_received_timestamp_s);
let first_token_generated_timestamp_s =
format_optional_f64(self.stats.first_token_generated_timestamp_s);
let request_finished_timestamp_s =
format_optional_f64(self.stats.request_finished_timestamp_s);
let cache_hit_rate = format_optional_f64(self.stats.cache_hit_rate);
let spec_decoding_acceptance_rate =
format_optional_f64(self.stats.spec_decoding_acceptance_rate);
let http_status_code = format_optional_u16(self.http_status_code);
let error_message = self
.error_message
.or(self.stats.error_message.as_deref())
.unwrap_or("None");

if is_otel_enabled() {
event!(
Level::INFO,
request_id = %self.request_id,
model = %self.model,
router_backend = %self.router_backend,
http_status_code = %http_status_code,
error_message = %error_message,
engine = %self.stats.engine,
request_received_timestamp_s = %request_received_timestamp_s,
first_token_generated_timestamp_s = %first_token_generated_timestamp_s,
request_finished_timestamp_s = %request_finished_timestamp_s,
cache_hit_rate = %cache_hit_rate,
spec_decoding_acceptance_rate = %spec_decoding_acceptance_rate,
prompt_tokens = self.stats.prompt_tokens,
completion_tokens = self.stats.completion_tokens,
cached_tokens = self.stats.cached_tokens,
"request_stats"
);
} else {
debug!(
request_id = %self.request_id,
model = %self.model,
router_backend = %self.router_backend,
http_status_code = %http_status_code,
error_message = %error_message,
engine = %self.stats.engine,
request_received_timestamp_s = %request_received_timestamp_s,
first_token_generated_timestamp_s = %first_token_generated_timestamp_s,
request_finished_timestamp_s = %request_finished_timestamp_s,
cache_hit_rate = %cache_hit_rate,
spec_decoding_acceptance_rate = %spec_decoding_acceptance_rate,
prompt_tokens = self.stats.prompt_tokens,
completion_tokens = self.stats.completion_tokens,
cached_tokens = self.stats.cached_tokens,
"request_stats"
);
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The emit function contains a large block of duplicated code for handling the is_otel_enabled() cases. This can be simplified by determining the log level first and then using a single event! macro call. This will make the code more maintainable and less prone to errors if new fields are added.

impl Event for RequestStatsEvent<'_> {
    #[inline]
    fn emit(&self) {
        let request_received_timestamp_s =
            format_optional_f64(self.stats.request_received_timestamp_s);
        let first_token_generated_timestamp_s =
            format_optional_f64(self.stats.first_token_generated_timestamp_s);
        let request_finished_timestamp_s =
            format_optional_f64(self.stats.request_finished_timestamp_s);
        let cache_hit_rate = format_optional_f64(self.stats.cache_hit_rate);
        let spec_decoding_acceptance_rate =
            format_optional_f64(self.stats.spec_decoding_acceptance_rate);
        let http_status_code = format_optional_u16(self.http_status_code);
        let error_message = self
            .error_message
            .or(self.stats.error_message.as_deref())
            .unwrap_or("None");

        let level = if is_otel_enabled() { Level::INFO } else { Level::DEBUG };
        event!(
            level,
            request_id = %self.request_id,
            model = %self.model,
            router_backend = %self.router_backend,
            http_status_code = %http_status_code,
            error_message = %error_message,
            engine = %self.stats.engine,
            request_received_timestamp_s = %request_received_timestamp_s,
            first_token_generated_timestamp_s = %first_token_generated_timestamp_s,
            request_finished_timestamp_s = %request_finished_timestamp_s,
            cache_hit_rate = %cache_hit_rate,
            spec_decoding_acceptance_rate = %spec_decoding_acceptance_rate,
            prompt_tokens = self.stats.prompt_tokens,
            completion_tokens = self.stats.completion_tokens,
            cached_tokens = self.stats.cached_tokens,
            "request_stats"
        );
    }
}
References
  1. The emit function contains duplicated logic for handling is_otel_enabled() cases. This rule suggests extracting duplicated logic into a shared helper function to improve maintainability and reduce redundancy.

Comment on lines +57 to +69
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This match statement can be simplified by using the ? operator, as the error types are compatible. This will make the code more concise and readable.

Suggested change
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await?;

Comment on lines +223 to +235
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This match statement can be simplified by using the ? operator, as the error types are compatible. This will make the code more concise and readable.

Suggested change
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await?;

Comment on lines +212 to +224
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This match statement can be simplified by using the ? operator, as the error types are compatible. This will make the code more concise and readable.

Suggested change
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await?;

Comment on lines +384 to +396
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This match statement can be simplified by using the ? operator, as the error types are compatible. This will make the code more concise and readable.

Suggested change
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = match response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await
{
Ok(collected) => collected,
Err(err) => return Err(err),
};
let response_collection::CollectedResponses {
completes: all_responses,
request_stats,
} = response_collection::collect_responses(
execution_result,
request_logprobs,
self.enable_request_statistics,
)
.await?;

Comment on lines +616 to +673
pub(crate) async fn collect_stream_responses(
stream: &mut ProtoStream,
worker_name: &str,
enable_request_statistics: bool,
) -> Result<CollectedStreamResponses, Response> {
let mut all_responses = Vec::new();
let mut stream_request_stats = Vec::new();

while let Some(response) = stream.next().await {
match response {
Ok(gen_response) => {
match gen_response.into_response() {
ProtoResponseVariant::Complete(complete) => {
all_responses.push(complete);
}
ProtoResponseVariant::Error(err) => {
error!(function = "collect_stream_responses", worker = %worker_name, error = %err.message(), "Worker generation error");
// Don't mark as completed - let Drop send abort for error cases
return Err(error::internal_error(
"worker_generation_failed",
format!("{} generation failed: {}", worker_name, err.message()),
));
}
ProtoResponseVariant::Chunk(_chunk) => {
// Streaming chunk - no action needed
}
ProtoResponseVariant::RequestStats(request_stats) => {
if enable_request_statistics {
stream_request_stats.push(request_stats);
}
}
ProtoResponseVariant::None => {
// Empty response - no action needed
}
}
}
Err(e) => {
error!(function = "collect_stream_responses", worker = %worker_name, error = ?e, "Worker stream error");
// Don't mark as completed - let Drop send abort for error cases
return Err(error::internal_error(
"worker_stream_failed",
format!("{worker_name} stream failed: {e}"),
));
}
}
}

let request_stats = if enable_request_statistics {
collect_request_stats(&all_responses, &stream_request_stats)
} else {
None
};

Ok(CollectedStreamResponses {
completes: all_responses,
request_stats,
})
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This collect_stream_responses function appears to be a duplicate of the one defined in model_gateway/src/routers/grpc/common/response_collection.rs. The function in this file also seems to be unused. To improve maintainability and avoid confusion, please remove this duplicated function.

References
  1. The collect_stream_responses function is a duplicate of one defined elsewhere. This rule advises against unnecessary duplication to improve maintainability and avoid confusion.

Copy link
Collaborator

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Scott, appreciate the effort on this — the idea of collecting request-level statistics from engines is genuinely useful and I can see the thought that went into supporting multiple backends. I couldn't help but take a look out of interest, and I noticed quite a few issues with the current design that I think are worth addressing before this goes further, even as a WIP.

I've left detailed line-by-line comments below, but here's the high-level summary:

Architectural concerns

  1. Stats collection is deeply entangled with abort/lifecycle — these are orthogonal concerns that shouldn't be coupled
  2. enable_request_statistics: bool is threaded through 15+ function signatures — this creates a massive API surface change for what should be an observability-only concern. The boolean-parameter-threading pattern is a well-known anti-pattern; consider a middleware/interceptor approach instead
  3. The proto changes couple stats into the streaming response union and the abort response — stats are metadata, not a response variant. This breaks the single-responsibility of those messages
  4. ~430 lines added to proto_wrapper.rs (already a large file) with three near-identical mapper implementations and four overlapping collection functions

Code quality concerns

  1. Multiple instances of match { Ok(x) => x, Err(e) => return Err(e) } which is literally the ? operator
  2. send_sse_or_abort takes 8 mutable reference parameters — a clear sign the function is doing too many things
  3. Box<dyn Error> used as error type throughout instead of proper error enums
  4. format_optional_f64 allocates String to represent None as the literal string "None" — use tracing's native Option support
  5. Copy-paste abort() methods across three gRPC client backends instead of a shared trait
  6. Two identical CollectedResponses/CollectedStreamResponses structs in different modules
  7. No tests for any of the new functionality
  8. Unrelated changes mixed in (MCP label constant, format string changes, finalized_analysis removal)

Suggested alternative approach

Rather than threading a boolean through every function, consider:

  • A stats-collecting stream wrapper that transparently intercepts ProtoStream and collects stats — zero changes to existing function signatures
  • Keep stats out of the proto response oneof — use the dedicated GetRequestStats RPC you already defined, or collect from the Complete messages you already receive
  • Keep abort as a pure lifecycle operation — don't overload it with stats collection

rpc GetLoads(GetLoadsRequest) returns (GetLoadsResponse);

// Get request-level statistics for a completed request
rpc GetRequestStats(GetRequestStatsRequest) returns (GetRequestStatsResponse);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RPC is defined but never called anywhere in the gateway code. If the plan is to use it later, it shouldn't be in this PR. If it's the intended mechanism for stats collection, then the in-band RequestStats variant in the streaming response (line 201) is redundant — you'd be collecting stats two different ways.

Pick one approach: either poll via this RPC after completion, or receive stats in-band. Having both is confusing and creates maintenance burden.

GenerateStreamChunk chunk = 2;
GenerateComplete complete = 3;
GenerateError error = 4;
RequestStats request_stats = 5;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding RequestStats as a variant in the GenerateResponse oneof is a fundamental design issue.

The oneof currently has a clean semantic: chunk | complete | error. These are response lifecycle states. RequestStats is metadata about the response, not a response itself. Mixing these concerns in the same union means every consumer of the stream now needs to handle a variant that has nothing to do with the generation lifecycle.

The Complete message already carries prompt_tokens, completion_tokens, and cached_tokens. For the additional fields (timestamps, cache hit rate, etc.), consider either:

  • Extending GenerateComplete with optional stats fields, or
  • Using the GetRequestStats RPC you defined at line 34 (which is currently unused)

string request_id = 1;
}

message RequestStats {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field numbering mixes optional and non-optional inconsistently. prompt_tokens (field 8), completion_tokens (field 9), and cached_tokens (field 10) are non-optional uint32, meaning they default to 0 on the wire. But semantically, "0 tokens" and "unknown" are different. If this message is meant to represent stats that may or may not be available from a given engine, all numeric fields should be optional.

Also, response_sent_timestamp_s (field 5) is defined here but never populated anywhere in the gateway code.

message AbortResponse {
bool success = 1;
string message = 2;
RequestStats request_stats = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the problem with this PR's design: abort is a lifecycle operation, not a stats-collection mechanism.

An abort says "stop processing this request." The response should confirm whether the abort succeeded — which it already does with success and message. Piggybacking stats onto the abort response couples two orthogonal concerns:

  1. It forces abort callers to understand and handle stats even when they don't care
  2. It means stats are only available if you abort — what about requests that complete naturally?
  3. It changes the abort contract from "fire and forget" to "fire and parse stats"

This then cascades into the gateway code where abort_request() changes from Result<(), tonic::Status> to Result<proto::AbortResponse, Box<dyn Error>>, which is a breaking API change across all three backends.

pub async fn abort(
&mut self,
reason: String,
) -> Result<proto::AbortResponse, Box<dyn std::error::Error + Send + Sync>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box<dyn std::error::Error + Send + Sync> as the error type is a step backwards from the existing tonic::Status. The original abort_request returned Result<(), tonic::Status> — callers knew exactly what errors to expect. Now they get an opaque box that could contain anything.

If you need a richer error type, define a proper error enum. Don't erase the type information.

buffer.extend_from_slice(b"\n\n");
}

async fn send_sse_or_abort(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async fn send_sse_or_abort(
    tx: &UnboundedSender<Result<Bytes, io::Error>>,
    payload: Bytes,
    client_disconnected: &mut bool,
    abort_sent: &mut bool,
    stream: &mut ProtoStream,
    send_error_message: &'static str,
    client_disconnect_error_message: &mut Option<&'static str>,
    stream_request_stats: Option<&mut Vec<ProtoRequestStats>>,
) -> Result<(), String>

8 parameters, 5 of which are mutable references. This function is doing at least three things:

  1. Sending an SSE payload
  2. Managing client disconnect state
  3. Aborting the backend stream and collecting stats

This should be decomposed. Consider a ClientConnection struct that encapsulates tx, client_disconnected, abort_sent, and client_disconnect_error_message. Then this becomes connection.send_or_abort(payload, stream, stats).

The function is called 13 times in this file alone, each time with the same 8-argument boilerplate. That's 13 call sites that will all break if the signature changes.

if !tool_calls.is_empty() {
let analysis_content = if has_analysis {
finalized_analysis
// Get analysis from finalized parser output by calling finalize again
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Get analysis from finalized parser output by calling finalize again
// This is safe because finalize can be called multiple times
let output = parser.finalize(finish_reason.clone());
output.analysis

Calling finalize() twice because the original finalized_analysis variable was removed is a hack. "This is safe because finalize can be called multiple times" — is this actually guaranteed by the parser contract? If it is, it should be documented on the trait. If it isn't, this is a latent bug.

The original code stored finalized_analysis for exactly this reason. Why was it removed?

let label = session
.map(|s| s.resolve_tool_server_label(tool_name))
.unwrap_or_else(|| DEFAULT_SERVER_LABEL.to_string());
.unwrap_or_else(|| "mcp".to_string());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.unwrap_or_else(|| "mcp".to_string()) — This replaces DEFAULT_SERVER_LABEL.to_string() with a hardcoded string literal. This appears in 3 places in this file (lines 777, 939, 1072).

This is an unrelated change that:

  1. Removes the use of a named constant, making future changes error-prone
  2. Hardcodes "mcp" without explanation of why the constant was wrong
  3. Shouldn't be in a PR about request statistics

Unrelated changes make PRs harder to review and should be in a separate commit at minimum.

/// # Returns
/// * `Ok(CollectedStreamResponses)` - Collected complete responses and unified request stats
/// * `Err(Response)` - Error response if the stream fails or returns an error
pub(crate) struct CollectedStreamResponses {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectedStreamResponses — this struct has the exact same fields as CollectedResponses in response_collection.rs (line 22): a Vec<ProtoGenerateComplete> and an Option<UnifiedRequestStats>.

Two identical structs in different modules for the same purpose is a clear DRY violation. Use one type.

merge_logprobs: bool,
) -> Result<Vec<ProtoGenerateComplete>, Response> {
let all_responses = match execution_result {
enable_request_statistics: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_request_statistics: bool — this parameter is now threaded through:

  • collect_responsescollect_stream_responses (this file)
  • ResponseProcessor::new, HarmonyResponseProcessor::new, HarmonyStreamingProcessor::new
  • StreamingProcessor::new, GenerateStreamContext
  • RequestPipeline::new_regular, ::new_harmony_single, ::new_harmony_dual, ::new_pd
  • HarmonyResponseProcessingStage::new
  • GrpcRouter::build, GrpcPDRouter::build

That's 15+ function signatures changed for a single boolean. This is the "boolean parameter threading" anti-pattern. The feature should be implemented at the boundary (e.g., a wrapper around ProtoStream that transparently collects stats) rather than plumbed through every layer of the stack.

Copy link
Collaborator

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 — more line-by-line issues I caught on closer inspection.

prefill_cached_tokens_by_index
.insert(complete_wrapper.index(), complete_wrapper.cached_tokens());
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation bug. The closing } here has been dedented one level:

            if let ProtoResponseVariant::Complete(complete_wrapper) = response.into_response() {
                prefill_cached_tokens_by_index
                    .insert(complete_wrapper.index(), complete_wrapper.cached_tokens());
        }  // <-- this should be at 12-space indent, not 8
        }

This makes it look like the if let body closes at the while level. The code still compiles because both braces are present, but the formatting is broken and misleading. Did cargo fmt actually run on this?

// Metadata from Complete message; seed cached_tokens from prefill phase (dual-stream)
let mut finish_reason: String;
let mut finalized_analysis: Option<String> = None;
let mut finish_reason = String::from("stop");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut finish_reason = String::from("stop"); — pre-initializing finish_reason to "stop" masks bugs. The original code left it uninitialized (declared later or via let finish_reason: String;), which means the compiler would catch any code path that uses it before the Complete message sets it.

By defaulting to "stop", if the Complete message is never received (e.g., stream ends early), you silently report a successful stop instead of surfacing the error. This defeats Rust's ability to catch missing-initialization bugs at compile time.

finalized_analysis = final_output.analysis;
accumulated_tool_calls = final_output.commentary;
// Store finalized tool calls and reasoning token count
accumulated_tool_calls.clone_from(&final_output.commentary);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accumulated_tool_calls.clone_from(&final_output.commentary); — why clone_from instead of = final_output.commentary? The clone_from method is typically used to reuse allocations when overwriting an existing value, but accumulated_tool_calls was None before this (it's initialized as Option<Vec<...>>). There's no allocation to reuse.

Also, the original code was accumulated_tool_calls = final_output.commentary; (a move). This PR changes it to a clone. Why? final_output isn't used after this point — the move was correct and zero-cost.

}
}
ProtoResponseVariant::Complete(complete) => {
completed_responses.push(complete.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed_responses.push(complete.clone()); — you're cloning the entire ProtoGenerateComplete proto message (which contains output text, logprobs, etc.) on the hot path, purely for stats collection.

This clone happens at every Complete event. For requests with n > 1, you clone multiple times. The stats you extract from it are just a few integers (prompt_tokens, completion_tokens, cached_tokens). Extract the stats eagerly into a lightweight struct instead of cloning the full proto.

Same issue in harmony/streaming.rs:303.

});
}

async fn emit_generate_request_stats(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async fn emit_generate_request_stats(...) — this function is marked async but its body is entirely synchronous. It calls collect_request_stats() (sync) and emit() (sync). No .await appears in the function body.

An unnecessary async forces all callers to .await it and prevents calling from synchronous contexts. Remove the async.

let mut prefill_cached_tokens_by_index: HashMap<u32, u32> = HashMap::new();
while let Some(result) = prefill_stream.next().await {
let response = result.map_err(|e| format!("Prefill stream error: {}", e.message()))?;
let response = result.map_err(|e| format!("Prefill stream error: {e}"))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format!("Prefill stream error: {e}") — this was previously format!("Prefill stream error: {}", e.message()). This is not just a style change.

e here is a tonic::Status. Display for tonic::Status includes the status code (e.g., "status: Internal, message: ..."), while .message() returns just the message string. This changes the error string format, which could break log parsers or error matching.

This is an unrelated semantic change buried in a stats PR.

}
}
ProtoResponseVariant::Complete(complete_wrapper) => {
completed_responses.push(complete_wrapper.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed_responses.push(complete_wrapper.clone()); — same unnecessary full-proto clone as in regular/streaming.rs:501. This is on the streaming hot path inside a while let Some(response) loop.

let mut error_message: Option<String> = None;

for sample in stats {
seen += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut seen = 0u64;
// ...
for sample in stats {
    seen += 1;
    // ...
}
if seen == 0 { return None; }

The seen counter is unnecessary. You can just check stats.is_empty() before the loop, or use the fact that stats.first()? already returns None for empty slices (which you do at the top of the calling function anyway). This manual counter adds complexity for no reason.

/// HarmonyParserAdapter to extract the complete response.
pub(crate) struct HarmonyResponseProcessor;
pub(crate) struct HarmonyResponseProcessor {
enable_request_statistics: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HarmonyResponseProcessor was a unit struct (struct HarmonyResponseProcessor;). Now it carries enable_request_statistics: bool.

This means a struct whose job is response processing now carries configuration about whether to collect metrics. This violates single responsibility — the processor should process responses, and something else should decide whether stats get emitted.

This is why the middleware/wrapper approach would be cleaner: the processor stays focused on its job, and the stats layer wraps around it.

processor: HarmonyResponseProcessor::new(enable_request_statistics),
streaming_processor: Arc::new(HarmonyStreamingProcessor::new(enable_request_statistics)),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streaming_processor: Arc::new(HarmonyStreamingProcessor::new(enable_request_statistics)),

This line is over 80 chars (rustfmt would catch this) and shows the config boolean being passed into an Arc-wrapped processor. The enable_request_statistics value is now baked into an Arc that lives for the lifetime of the stage. If you ever wanted to toggle stats at runtime (e.g., via a config reload), this architecture makes it impossible.

scottjlee and others added 8 commits March 11, 2026 22:45
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
@scottjlee
Copy link
Collaborator Author

CI fails because the branch is on my fork, so API keys used in the CI tests are unavailable. Close in favor of #757 which should fix this issue.

@scottjlee scottjlee closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants