-
Notifications
You must be signed in to change notification settings - Fork 53
feat(completions): add native gRPC stages for /v1/completions #772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
vschandramourya
wants to merge
2
commits into
feat/completions-response-infra
from
feat/completions-native-stages
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15 changes: 15 additions & 0 deletions
15
model_gateway/src/routers/grpc/regular/stages/completion/mod.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| //! Completion endpoint pipeline stages | ||
| //! | ||
| //! These stages handle completion-specific preprocessing, request building, and | ||
| //! response processing. `CompletionRequest` flows natively as | ||
| //! `RequestType::Completion` through every pipeline stage — preparation, worker | ||
| //! selection, client acquisition, request building, execution, and response | ||
| //! processing — following the same architecture as chat and generate. | ||
|
|
||
| mod preparation; | ||
| mod request_building; | ||
| mod response_processing; | ||
|
|
||
| pub(crate) use preparation::CompletionPreparationStage; | ||
| pub(crate) use request_building::CompletionRequestBuildingStage; | ||
| pub(crate) use response_processing::CompletionResponseProcessingStage; |
78 changes: 78 additions & 0 deletions
78
model_gateway/src/routers/grpc/regular/stages/completion/preparation.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| //! Completion preparation stage: Resolve prompt, tokenize, create stop decoder | ||
| //! | ||
| //! Structurally mirrors `GeneratePreparationStage` but reads from | ||
| //! `CompletionRequest` fields directly (prompt, stop, skip_special_tokens). | ||
| //! No chat template, no tool calls, no multimodal. | ||
|
|
||
| use async_trait::async_trait; | ||
| use axum::response::Response; | ||
| use openai_protocol::common::StringOrArray; | ||
| use tracing::error; | ||
|
|
||
| use crate::routers::{ | ||
| error, | ||
| grpc::{ | ||
| common::stages::PipelineStage, | ||
| context::{PreparationOutput, RequestContext}, | ||
| utils, | ||
| }, | ||
| }; | ||
|
|
||
| pub(crate) struct CompletionPreparationStage; | ||
|
|
||
| #[async_trait] | ||
| impl PipelineStage for CompletionPreparationStage { | ||
| async fn execute(&self, ctx: &mut RequestContext) -> Result<Option<Response>, Response> { | ||
| let request = ctx.completion_request_arc(); | ||
|
|
||
| let tokenizer = | ||
| utils::resolve_tokenizer(ctx, "CompletionPreparationStage::execute").map_err(|e| *e)?; | ||
|
|
||
| let prompt_text = match &request.prompt { | ||
| StringOrArray::String(s) => s.clone(), | ||
| StringOrArray::Array(_) => { | ||
| return Err(error::bad_request( | ||
| "batch_prompts_not_supported", | ||
| "Batched prompt arrays are not supported for gRPC /v1/completions yet", | ||
| )); | ||
| } | ||
| }; | ||
|
|
||
| let encoding = tokenizer.encode(&prompt_text, false).map_err(|e| { | ||
| error!( | ||
| function = "CompletionPreparationStage::execute", | ||
| error = %e, | ||
| "Tokenization failed" | ||
| ); | ||
| error::bad_request("tokenization_failed", format!("Tokenization failed: {e}")) | ||
| })?; | ||
|
|
||
| let stop_decoder = utils::create_stop_decoder( | ||
| &tokenizer, | ||
| request.stop.as_ref(), | ||
| request.stop_token_ids.as_ref(), | ||
| request.skip_special_tokens, | ||
| request.no_stop_trim, | ||
| ); | ||
|
|
||
| ctx.state.preparation = Some(PreparationOutput { | ||
| original_text: Some(prompt_text), | ||
| token_ids: encoding.token_ids().to_vec(), | ||
| processed_messages: None, | ||
| tool_constraints: None, | ||
| filtered_request: None, | ||
| harmony_mode: false, | ||
| selection_text: None, | ||
| harmony_messages: None, | ||
| harmony_stop_ids: None, | ||
| }); | ||
|
|
||
| ctx.state.response.stop_decoder = Some(stop_decoder); | ||
|
|
||
| Ok(None) | ||
| } | ||
|
|
||
| fn name(&self) -> &'static str { | ||
| "CompletionPreparation" | ||
| } | ||
| } |
95 changes: 95 additions & 0 deletions
95
model_gateway/src/routers/grpc/regular/stages/completion/request_building.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| //! Completion request building stage: Build proto GenerateRequest from CompletionRequest | ||
| //! | ||
| //! Follows the same pattern as `ChatRequestBuildingStage` and | ||
| //! `GenerateRequestBuildingStage`: reads preparation output, gets the gRPC | ||
| //! client, and calls `builder_client.build_completion_request()` to convert | ||
| //! `CompletionRequest` directly to the backend proto format. | ||
|
|
||
| use async_trait::async_trait; | ||
| use axum::response::Response; | ||
| use tracing::error; | ||
| use uuid::Uuid; | ||
|
|
||
| use crate::routers::{ | ||
| error, | ||
| grpc::{ | ||
| common::stages::{helpers, PipelineStage}, | ||
| context::{ClientSelection, RequestContext}, | ||
| proto_wrapper::ProtoRequest, | ||
| }, | ||
| }; | ||
|
|
||
| pub(crate) struct CompletionRequestBuildingStage { | ||
| inject_pd_metadata: bool, | ||
| } | ||
|
|
||
| impl CompletionRequestBuildingStage { | ||
| pub fn new(inject_pd_metadata: bool) -> Self { | ||
| Self { inject_pd_metadata } | ||
| } | ||
| } | ||
|
|
||
| #[async_trait] | ||
| impl PipelineStage for CompletionRequestBuildingStage { | ||
| async fn execute(&self, ctx: &mut RequestContext) -> Result<Option<Response>, Response> { | ||
| let prep = ctx.state.preparation.as_ref().ok_or_else(|| { | ||
| error!( | ||
| function = "CompletionRequestBuildingStage::execute", | ||
| "Preparation not completed" | ||
| ); | ||
| error::internal_error("preparation_not_completed", "Preparation not completed") | ||
| })?; | ||
|
|
||
| let clients = ctx.state.clients.as_ref().ok_or_else(|| { | ||
| error!( | ||
| function = "CompletionRequestBuildingStage::execute", | ||
| "Client acquisition not completed" | ||
| ); | ||
| error::internal_error( | ||
| "client_acquisition_not_completed", | ||
| "Client acquisition not completed", | ||
| ) | ||
| })?; | ||
|
|
||
| let completion_request = ctx.completion_request_arc(); | ||
|
|
||
| let builder_client = match clients { | ||
| ClientSelection::Single { client } => client, | ||
| ClientSelection::Dual { prefill, .. } => prefill, | ||
| }; | ||
|
|
||
| let request_id = format!("cmpl-{}", Uuid::now_v7()); | ||
|
|
||
| let mut proto_request = builder_client | ||
| .build_completion_request( | ||
| request_id, | ||
| &completion_request, | ||
| prep.original_text.clone().unwrap_or_default(), | ||
| prep.token_ids.clone(), | ||
| ) | ||
| .map_err(|e| { | ||
| error!( | ||
| function = "CompletionRequestBuildingStage::execute", | ||
| error = %e, | ||
| "Failed to build completion request" | ||
| ); | ||
| error::bad_request( | ||
| "invalid_request_parameters", | ||
| format!("Invalid request parameters: {e}"), | ||
| ) | ||
| })?; | ||
|
|
||
| if self.inject_pd_metadata { | ||
| if let Some(workers) = ctx.state.workers.as_ref() { | ||
| helpers::maybe_inject_pd_metadata(&mut proto_request, workers); | ||
| } | ||
| } | ||
|
|
||
| ctx.state.proto_request = Some(ProtoRequest::Generate(proto_request)); | ||
| Ok(None) | ||
| } | ||
|
|
||
| fn name(&self) -> &'static str { | ||
| "CompletionRequestBuilding" | ||
| } | ||
| } | ||
147 changes: 147 additions & 0 deletions
147
model_gateway/src/routers/grpc/regular/stages/completion/response_processing.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| //! Completion response processing stage | ||
| //! | ||
| //! Non-streaming: collects generate results via the shared processor and wraps | ||
| //! them as `CompletionResponse`. Streaming: delegates to the completion-aware | ||
| //! streaming processor, which emits OpenAI `CompletionStreamResponse` chunks | ||
| //! directly from typed proto responses. | ||
|
|
||
| use std::sync::Arc; | ||
|
|
||
| use async_trait::async_trait; | ||
| use axum::response::Response; | ||
| use tracing::error; | ||
|
|
||
| use crate::{ | ||
| core::AttachedBody, | ||
| routers::{ | ||
| error, | ||
| grpc::{ | ||
| common::stages::PipelineStage, | ||
| context::{FinalResponse, RequestContext}, | ||
| regular::{processor, streaming}, | ||
| }, | ||
| }, | ||
| }; | ||
|
|
||
| pub(crate) struct CompletionResponseProcessingStage { | ||
| processor: processor::ResponseProcessor, | ||
| streaming_processor: Arc<streaming::StreamingProcessor>, | ||
| } | ||
|
|
||
| impl CompletionResponseProcessingStage { | ||
| pub fn new( | ||
| processor: processor::ResponseProcessor, | ||
| streaming_processor: Arc<streaming::StreamingProcessor>, | ||
| ) -> Self { | ||
| Self { | ||
| processor, | ||
| streaming_processor, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[async_trait] | ||
| impl PipelineStage for CompletionResponseProcessingStage { | ||
| async fn execute(&self, ctx: &mut RequestContext) -> Result<Option<Response>, Response> { | ||
| self.process_completion_response(ctx).await | ||
| } | ||
|
|
||
| fn name(&self) -> &'static str { | ||
| "CompletionResponseProcessing" | ||
| } | ||
| } | ||
|
|
||
| impl CompletionResponseProcessingStage { | ||
| async fn process_completion_response( | ||
| &self, | ||
| ctx: &mut RequestContext, | ||
| ) -> Result<Option<Response>, Response> { | ||
| let is_streaming = ctx.is_streaming(); | ||
| let completion_req = ctx.completion_request_arc(); | ||
|
|
||
| let execution_result = ctx.state.response.execution_result.take().ok_or_else(|| { | ||
| error!( | ||
| function = "CompletionResponseProcessingStage::execute", | ||
| "No execution result" | ||
| ); | ||
| error::internal_error("no_execution_result", "No execution result") | ||
| })?; | ||
|
|
||
| let dispatch = ctx | ||
| .state | ||
| .dispatch | ||
| .as_ref() | ||
| .ok_or_else(|| { | ||
| error!( | ||
| function = "CompletionResponseProcessingStage::execute", | ||
| "Dispatch metadata not set" | ||
| ); | ||
| error::internal_error("dispatch_metadata_not_set", "Dispatch metadata not set") | ||
| })? | ||
| .clone(); | ||
|
|
||
| let tokenizer = ctx.tokenizer_arc().ok_or_else(|| { | ||
| error!( | ||
| function = "CompletionResponseProcessingStage::execute", | ||
| "Tokenizer not cached in context" | ||
| ); | ||
| error::internal_error( | ||
| "tokenizer_not_cached", | ||
| "Tokenizer not cached in context - preparation stage may have been skipped", | ||
| ) | ||
| })?; | ||
|
|
||
| let prompt_text = ctx | ||
| .state | ||
| .preparation | ||
| .as_ref() | ||
| .and_then(|p| p.original_text.clone()) | ||
| .unwrap_or_default(); | ||
|
|
||
| if is_streaming { | ||
| let response = self | ||
| .streaming_processor | ||
| .clone() | ||
| .process_streaming_completion( | ||
| execution_result, | ||
| completion_req.clone(), | ||
| dispatch, | ||
| tokenizer, | ||
| prompt_text, | ||
| ); | ||
|
|
||
| let response = match ctx.state.load_guards.take() { | ||
| Some(guards) => AttachedBody::wrap_response(response, guards), | ||
| None => response, | ||
| }; | ||
| return Ok(Some(response)); | ||
| } | ||
|
|
||
| // Non-streaming | ||
| let stop_decoder = ctx.state.response.stop_decoder.as_mut().ok_or_else(|| { | ||
| error!( | ||
| function = "CompletionResponseProcessingStage::execute", | ||
| "Stop decoder not initialized" | ||
| ); | ||
| error::internal_error( | ||
| "stop_decoder_not_initialized", | ||
| "Stop decoder not initialized", | ||
| ) | ||
| })?; | ||
|
|
||
| let completion_response = self | ||
| .processor | ||
| .process_non_streaming_completion_response( | ||
| execution_result, | ||
| completion_req, | ||
| dispatch, | ||
| tokenizer, | ||
| stop_decoder, | ||
| &prompt_text, | ||
| ) | ||
| .await?; | ||
|
|
||
| ctx.state.response.final_response = Some(FinalResponse::Completion(completion_response)); | ||
| Ok(None) | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
CompletionPreparationStagealways setsoriginal_text, soprep.original_textshould not beNoneat this point. Usingunwrap_or_default()can hide a potential logic error if it were everNone, which could lead to silent failures where an empty string is used as the prompt. It would be more robust to explicitly handle theNonecase as an internal error. This makes the contract between stages explicit and ensures the system fails fast if an invariant is broken.