Skip to content

feat(render): implement VllmRender gRPC service for GPU-less rendering#784

Open
hyeongyun0916 wants to merge 20 commits intolightseekorg:mainfrom
hyeongyun0916:feat/vllm-render
Open

feat(render): implement VllmRender gRPC service for GPU-less rendering#784
hyeongyun0916 wants to merge 20 commits intolightseekorg:mainfrom
hyeongyun0916:feat/vllm-render

Conversation

@hyeongyun0916
Copy link

@hyeongyun0916 hyeongyun0916 commented Mar 17, 2026

Description

Context

This PR adds render gRPC support (VllmRender service) to smg-grpc-proto and smg-grpc-servicer, required by vllm-project/vllm#36102.

Per review feedback, the render servicer should live in this package rather than in the vllm repo, following the same pattern as VllmEngineServicer (#36169).

Problem

vLLM's disaggregated serving architecture requires a GPU-less render node that applies chat templates and tokenizes requests without running inference. Currently there is no gRPC interface for this render-only functionality, limiting communication between prefill/decode nodes and the render node to HTTP only.

Solution

Implement a new VllmRender gRPC service with management RPCs (HealthCheck, GetModelInfo, GetServerInfo) and rendering RPCs (RenderChat, RenderCompletion). The service converts protobuf messages to vLLM's Pydantic request models, delegates to openai_serving_render, and serializes responses back to proto.

Changes

  • Add vllm_render.proto defining the VllmRender service, chat/completion rendering messages, and GenerateRequestProto
  • Add RenderGrpcServicer implementing all VllmRender RPCs with proper gRPC status code error handling
  • Add proto_utils.py with generic protobuf ↔ Pydantic/dict conversion utilities (proto_to_dict, from_proto, pydantic_to_proto)
  • Add field_transforms.py with transform rules bridging proto field naming limitations to vLLM's OpenAI-compatible Python models
  • Export vllm_render_pb2 / vllm_render_pb2_grpc from smg-grpc-proto package
  • Bump smg-grpc-proto to 0.5.0 and smg-grpc-servicer to 0.6.0
  • Add 51 unit tests covering field_transforms, proto_utils, and render_servicer

Test Plan

  • pytest grpc_servicer/tests/ -v — 51 passed
  • Verify proto compiles: pip install -e crates/grpc_client/python/ builds vllm_render_pb2 stubs successfully
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features
    • Added a GPU-less vLLM Render gRPC service for chat and completion rendering with OpenAI-compatible multimodal schema and tool-invocation support; health/model/server info endpoints included.
  • Chores
    • Package version bumps, public exports updated to surface render protos/servicer, and build updated to include the new render proto.
  • Tests
    • Added unit tests for the render servicer, field transforms, and protobuf conversion utilities.

…ing RPCs

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a GPU-less vLLM render gRPC surface: new vllm_render proto, client build/export updates, RenderGrpcServicer implementation, proto↔Pydantic conversion utilities and field transforms, packaging/dependency bumps, and comprehensive unit tests for the new functionality.

Changes

Cohort / File(s) Summary
Proto Definitions
crates/grpc_client/proto/vllm_render.proto
New vllm.grpc.render proto: VllmRender service (HealthCheck, GetModelInfo, GetServerInfo, RenderChat, RenderCompletion) and messages for multimodal content, tools, chat messages, prompts, render requests/responses, and GenerateRequestProto.
Client build & Python exports
crates/grpc_client/build.rs, crates/grpc_client/python/pyproject.toml, crates/grpc_client/python/smg_grpc_proto/__init__.py
Adds proto to build triggers/compilation, bumps Python client version to 0.5.0, and re-exports generated vllm_render_pb2/vllm_render_pb2_grpc.
Servicer surface
grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py, grpc_servicer/smg_grpc_servicer/vllm/__init__.py
New RenderGrpcServicer with management RPCs and RenderChat/RenderCompletion handlers; exported via vllm package __all__.
Proto ↔ Pydantic helpers
grpc_servicer/smg_grpc_servicer/vllm/proto_utils.py, grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py
Adds conversion utilities (proto_to_dict, from_proto, pydantic_to_proto, _apply_transforms) and field transforms (flatten_completion_prompt, _parse_tool_choice, _ensure_message_content, FIELD_TRANSFORMS).
Packaging & deps
grpc_servicer/pyproject.toml
Bumps servicer version to 0.6.0, updates dependency smg-grpc-proto>=0.5.0, adds dev extras (pytest, pytest-asyncio, vllm) and pytest config.
Tests & fixtures
grpc_servicer/tests/conftest.py, grpc_servicer/tests/test_field_transforms.py, grpc_servicer/tests/test_proto_utils.py, grpc_servicer/tests/test_render_servicer.py
Adds fixtures and extensive unit tests covering field transforms, proto utilities, RenderGrpcServicer behavior (success, errors, and serialization paths).

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant RenderServicer as RenderGrpcServicer
    participant ProtoUtils as ProtoUtils
    participant Pydantic as PydanticModel
    participant Renderer as vLLM_Render

    Client->>RenderServicer: RenderChat(RenderChatRequest proto)
    RenderServicer->>ProtoUtils: from_proto(proto, transforms)
    ProtoUtils->>ProtoUtils: MessageToDict + _apply_transforms
    ProtoUtils->>Pydantic: construct request model
    Pydantic-->>RenderServicer: request instance
    RenderServicer->>Renderer: render_chat_request(request)
    Renderer-->>RenderServicer: GenerateRequest (Pydantic)
    RenderServicer->>ProtoUtils: pydantic_to_proto(GenerateRequest)
    ProtoUtils-->>RenderServicer: GenerateRequestProto
    RenderServicer-->>Client: RenderChatResponse(GenerateRequestProto)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

grpc, multimodal

Suggested reviewers

  • key4ng
  • slin1237
  • njhill

Poem

🐰 I hopped through proto lines tonight,
drew chats, images, tokens bright,
parsed prompts into tidy nests,
stitched Pydantic to proto vests,
tests twitched noses — render's right.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main feature: implementing a VllmRender gRPC service for GPU-less rendering, which aligns perfectly with the PR's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial VllmRender gRPC service to support vLLM's disaggregated serving architecture. This new service enables GPU-less render nodes to efficiently apply chat templates and tokenize requests via a gRPC interface, addressing the current limitation of HTTP-only communication for render-only functionality. The changes facilitate seamless integration and improved performance for rendering operations within the vLLM ecosystem.

Highlights

  • New gRPC Service: Introduced VllmRender gRPC service for GPU-less rendering, including management and rendering RPCs to support vLLM's disaggregated serving architecture.
  • Protocol Definition: Defined vllm_render.proto to specify the new service, comprehensive chat/completion rendering messages, and GenerateRequestProto.
  • Servicer Implementation: Implemented RenderGrpcServicer to handle VllmRender RPCs with robust gRPC status code error handling.
  • Protobuf Utilities: Added proto_utils.py for generic protobuf to Pydantic/dict conversion and field_transforms.py for bridging proto field naming limitations to vLLM's OpenAI-compatible Python models.
  • Package Updates: Exported new protobuf stubs (vllm_render_pb2, vllm_render_pb2_grpc) and bumped smg-grpc-proto to 0.5.0 and smg-grpc-servicer to 0.6.0.
  • Comprehensive Testing: Included 51 new unit tests covering field transformations, protobuf utilities, and the render servicer to ensure correctness.
Changelog
  • crates/grpc_client/proto/vllm_render.proto
    • Added a new protobuf definition for the VllmRender service, including HealthCheck, GetModelInfo, GetServerInfo, RenderChat, and RenderCompletion RPCs.
    • Defined messages for chat rendering (e.g., ImageUrlContent, ContentPart, ChatCompletionMessage, RenderChatRequest, RenderChatResponse) and completion rendering (e.g., CompletionPrompt, RenderCompletionRequest, RenderCompletionResponse).
    • Included GenerateRequestProto for mirroring vLLM's internal request protocol.
  • crates/grpc_client/python/pyproject.toml
    • Updated the smg-grpc-proto package version from 0.4.2 to 0.5.0.
  • crates/grpc_client/python/smg_grpc_proto/init.py
    • Exported the newly generated vllm_render_pb2 and vllm_render_pb2_grpc modules.
  • grpc_servicer/pyproject.toml
    • Updated the smg-grpc-servicer package version from 0.5.0 to 0.6.0.
    • Updated the dependency on smg-grpc-proto to >=0.5.0.
    • Added dev optional dependencies for pytest and pytest-asyncio.
    • Configured pytest.ini_options to specify test paths and asyncio mode.
  • grpc_servicer/smg_grpc_servicer/vllm/init.py
    • Imported and exported RenderGrpcServicer to make it part of the public API.
  • grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py
    • Added a new module defining FIELD_TRANSFORMS for converting protobuf field names/values to match vLLM's Pydantic models.
    • Implemented flatten_completion_prompt to handle CompletionPrompt oneof fields.
    • Implemented _ensure_message_content to ensure content keys are present in chat messages.
  • grpc_servicer/smg_grpc_servicer/vllm/proto_utils.py
    • Added a new module with generic utilities for converting between protobuf messages and Python types.
    • Provided proto_to_dict, from_proto, and pydantic_to_proto functions.
    • Included _apply_transforms for recursive application of field transformations.
  • grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py
    • Added a new module implementing the RenderGrpcServicer class.
    • Implemented HealthCheck, GetModelInfo, GetServerInfo, RenderChat, and RenderCompletion RPCs.
    • Integrated proto_utils and field_transforms for request/response handling.
  • grpc_servicer/tests/conftest.py
    • Added a new module providing shared pytest fixtures, including mock_grpc_context, mock_model_config, and mock_state.
  • grpc_servicer/tests/test_field_transforms.py
    • Added a new module with unit tests for field_transforms.py, covering flatten_completion_prompt and _ensure_message_content.
  • grpc_servicer/tests/test_proto_utils.py
    • Added a new module with unit tests for proto_utils.py, covering _apply_transforms, proto_to_dict, from_proto, and pydantic_to_proto.
  • grpc_servicer/tests/test_render_servicer.py
    • Added a new module with unit tests for render_servicer.py, covering all RPC methods of RenderGrpcServicer.
Activity
  • The author, hyeongyun0916, has implemented a new gRPC service for GPU-less rendering in vLLM.
  • The pull request includes 51 new unit tests which have passed, ensuring the correctness of the new protobuf utilities, field transformations, and the RenderGrpcServicer.
  • The protobuf definitions were successfully compiled, confirming the integrity of the new .proto file.
  • The author has performed cargo +nightly fmt and cargo clippy checks, indicating adherence to code style and quality standards.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions bot added dependencies Dependency updates tests Test changes labels Mar 17, 2026
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new VllmRender gRPC service for GPU-less rendering. The implementation is well-structured, separating concerns into protobuf definitions, conversion utilities, and the service logic. The code is clean, robustly handles errors, and is accompanied by a comprehensive test suite. I have one minor suggestion to improve code clarity in the servicer implementation by removing some unreachable code. Overall, this is an excellent contribution.

grpc.StatusCode.UNIMPLEMENTED,
"RenderChat is not configured on this server.",
)
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return statement here is unreachable because context.abort() raises a grpc.aio.AbortError exception, which terminates the method's execution. Removing this unreachable code improves clarity.

This same pattern of an unreachable return after context.abort() also occurs on lines 77, 95, and 109. All of them can be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied in 949d979

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7c5574205

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +57 to +61
FIELD_TRANSFORMS: dict[str, tuple[str, Any]] = {
"parameters_json": ("parameters", json.loads),
"content_parts": ("content", None),
"prompt": ("prompt", flatten_completion_prompt),
"messages": ("messages", _ensure_message_content),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse structured tool_choice before building ChatCompletionRequest

RenderChatRequest documents that tool_choice may be JSON ("none"/"auto"/"required" or JSON), but the transform table never deserializes that field, so from_proto(...) forwards raw JSON text into ChatCompletionRequest instead of an object. Requests that force a specific tool (for example a named function choice) will therefore be interpreted as a plain string and fail validation or behave incorrectly, which breaks tool-calling render flows.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied in 0f36685

Comment on lines +86 to +87
except Exception as e:
await context.abort(grpc.StatusCode.INTERNAL, str(e))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Map request-decoding failures to INVALID_ARGUMENT

The broad except Exception path converts all parsing/validation failures into INTERNAL, so malformed client payloads (for example bad JSON in parameters_json during proto→dict transforms) are reported as server faults instead of request errors. This misclassifies user input bugs as 500s, can trigger unnecessary retries/alerts, and makes debugging client-side request issues harder.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied in d86106a

…AbortError handling

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…ethods

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d86106a2db

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +40 to +44
return vllm_engine_pb2.GetModelInfoResponse(
model_path=model_config.model,
is_generation=model_config.runner_type == "generate",
max_context_length=model_config.max_model_len,
vocab_size=model_config.get_vocab_size(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate served_model_name in render model info

GetModelInfo builds a vllm_engine_pb2.GetModelInfoResponse but never sets served_model_name, so this RPC always reports an empty alias even when the server is configured with a custom served model name. In deployments where served_model_name differs from model_path, downstream discovery/routing that prioritizes this label (for example model ID selection in model_gateway/src/core/steps/worker/local/create_worker.rs) can pick the wrong identifier and misroute traffic; this should mirror the engine servicer behavior by filling served_model_name from model config.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in 9cd6eb2 — GetModelInfo now sets served_model_name from model_config.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/proto/vllm_render.proto`:
- Around line 43-49: The ContentPart message currently allows multiple payload
fields to be set simultaneously; change its definition to enforce exclusivity by
wrapping text, image_url, input_audio, and video_url inside a oneof (e.g., oneof
payload) so only one variant can be present at a time; update the ContentPart
message (and any generated/consuming code expectations) to use the oneof payload
for the fields referenced as text, image_url (ImageUrlContent), input_audio
(InputAudioContent), and video_url (VideoUrlContent) to match the Rust enum
semantics.

In `@grpc_servicer/pyproject.toml`:
- Around line 33-35: The dev extra is missing the vllm dependency which causes
pip install -e .[dev] to fail because tests import smg_grpc_servicer.vllm.*
(top-level imports from vllm like vllm.logger, vllm.outputs); update the dev
extra (the "dev" entry in pyproject.toml) to include vllm (e.g., add
"vllm>=0.16.0" to the list or reference the vllm extra via ".[vllm]") so
installing the dev extras pulls in vllm.

In `@grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py`:
- Line 41: Replace the silent "return None" in the CompletionPrompt handling
code with an explicit ValueError so malformed prompt dicts fail fast; locate the
branch that checks/handles CompletionPrompt shapes in field_transforms.py (the
code that currently returns None for unknown prompt dicts) and raise
ValueError("Unsupported CompletionPrompt shape") (or a similarly descriptive
message including the offending value) instead of returning None so the caller
can map it to an INVALID_ARGUMENT error.

In `@grpc_servicer/smg_grpc_servicer/vllm/proto_utils.py`:
- Around line 47-50: The pydantic_to_proto function currently calls
ParseDict(..., ignore_unknown_fields=True) which silently drops unknown fields;
change it to fail-fast or explicitly whitelist fields: either remove
ignore_unknown_fields=True so ParseDict raises on unknown keys, or derive an
allowlist from the target proto (e.g., use message_class.DESCRIPTOR.fields to
get allowed field names) and filter the dict returned by
model.model_dump(mode="json", exclude_none=True) to only those keys before
calling ParseDict; reference the pydantic_to_proto function, the message_class
parameter, and ParseDict when making the change.

In `@grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py`:
- Around line 40-46: The GetModelInfoResponse / GetServerInfoResponse currently
rely on proto defaults for shared fields; explicitly set served_model_name,
active_requests, is_paused, kv_connector, and kv_role when constructing
responses in render_servicer.py (the GetModelInfoResponse return and the
analogous GetServerInfoResponse around lines 49-53) so consumers aren’t left
with ambiguous defaults—use the appropriate values from model_config or server
state (e.g., served model identifier from model_config, current active request
count, paused state flag, and KV connector/role info) and fall back to explicit
zero/empty values only if the source is absent, then run the
request_verification mentioned to ensure no consumer expects implicit defaults.
- Around line 86-87: Replace the direct exposure of internal exception text in
the except blocks that call await context.abort(grpc.StatusCode.INTERNAL,
str(e)) (occurrences around the context.abort calls at lines referenced) by
logging the full exception server-side (use logger.exception(...) or create
module logger = logging.getLogger(__name__) and call
logger.exception("render_servicer error")) and then aborting with a generic
message such as await context.abort(grpc.StatusCode.INTERNAL, "Internal server
error"); update both places that use str(e) (the except blocks referencing
variable e) to follow this pattern.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4f4cb66a-2f59-41a7-b57f-d246aa9f1b0f

📥 Commits

Reviewing files that changed from the base of the PR and between 3282dc1 and d86106a.

📒 Files selected for processing (13)
  • crates/grpc_client/proto/vllm_render.proto
  • crates/grpc_client/python/pyproject.toml
  • crates/grpc_client/python/smg_grpc_proto/__init__.py
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/vllm/__init__.py
  • grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py
  • grpc_servicer/smg_grpc_servicer/vllm/proto_utils.py
  • grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py
  • grpc_servicer/tests/__init__.py
  • grpc_servicer/tests/conftest.py
  • grpc_servicer/tests/test_field_transforms.py
  • grpc_servicer/tests/test_proto_utils.py
  • grpc_servicer/tests/test_render_servicer.py

…ypes

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…r empty dict and unknown keys

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…s parameter

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…de served_model_name and additional server info

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…letion methods

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/proto/vllm_render.proto`:
- Around line 119-121: The TokenIdSequence.message and GenerateRequestProto use
different scalar types for token IDs (TokenIdSequence.token_ids is uint32 while
GenerateRequestProto.token_ids is int32); pick one consistent scalar (either
change TokenIdSequence.token_ids to int32 or change
GenerateRequestProto.token_ids to uint32), update both message definitions to
the chosen type, and then regenerate protobuf artifacts and update any code that
relies on TokenIdSequence or GenerateRequestProto to the unified type to avoid
type mismatches.

In `@grpc_servicer/smg_grpc_servicer/vllm/proto_utils.py`:
- Line 20: The FieldTransforms type alias currently uses Any for the transform
which weakens static checking; update FieldTransforms to use a callable type
such as Callable[[Any], Any] | None (e.g. FieldTransforms = dict[str, tuple[str,
Callable[[Any], Any] | None]]) and add the necessary import for Callable from
typing so editors and type-checkers can validate transform functions used by the
code.

In `@grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py`:
- Around line 9-24: Move the module-level logger initialization so all imports
are grouped above it: relocate the line defining logger =
logging.getLogger(__name__) to after the import block (after the last import,
e.g. after the import of ErrorResponse) so no imports are interleaved with
module-level code in render_servicer.py; ensure any code relying on the logger
still references the same symbol name (logger).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 401e2da5-518d-4628-8277-b476a83f9857

📥 Commits

Reviewing files that changed from the base of the PR and between d86106a and 9ee771a.

📒 Files selected for processing (9)
  • crates/grpc_client/proto/vllm_render.proto
  • grpc_servicer/pyproject.toml
  • grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py
  • grpc_servicer/smg_grpc_servicer/vllm/proto_utils.py
  • grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py
  • grpc_servicer/tests/conftest.py
  • grpc_servicer/tests/test_field_transforms.py
  • grpc_servicer/tests/test_proto_utils.py
  • grpc_servicer/tests/test_render_servicer.py

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9ee771ad64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

repeated ChatCompletionTool tools = 3;
optional string tool_choice = 4; // "none"/"auto"/"required" or JSON
optional string chat_template = 5;
bool add_generation_prompt = 6;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make render bool overrides presence-aware

RenderChatRequest.add_generation_prompt (and similarly RenderCompletionRequest.add_special_tokens) is defined as a plain proto3 bool, but request decoding goes through MessageToDict in from_proto, which drops false-valued scalar fields without presence. That means a client explicitly sending false cannot be distinguished from “unset”, so downstream request defaults are used instead of the caller’s override; in render flows this can flip prompt templating/tokenization behavior for valid inputs that require disabling these flags.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed add_generation_prompt, continue_final_message, and add_special_tokens to optional bool for presence awareness.

Comment on lines +60 to +61
new_key, fn = transforms[key]
result[new_key] = fn(value) if fn else value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject transformed-key collisions during proto mapping

_apply_transforms silently overwrites existing keys when a rename targets a key already present (for example content_parts -> content while content is also set in the same chat message). Because RenderChatRequest allows both fields (not a oneof), contradictory payloads are currently accepted with lossy, order-dependent data selection instead of returning INVALID_ARGUMENT, which can corrupt rendered prompts in malformed-but-possible client requests.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current FIELD_TRANSFORMS mappings don't have source/target key overlap with proto field names, so collision can't occur. Will add a guard if the transform table grows.

…tProto

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…Request to optional

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ace448ac24

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if "text" in prompt_dict:
return prompt_dict["text"]
if "texts" in prompt_dict:
return list(prompt_dict["texts"]["texts"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard empty completion prompt branches before indexing

flatten_completion_prompt assumes nested keys like texts.texts always exist, but MessageToDict drops empty repeated fields, so a request that selects the texts oneof with an empty list can deserialize as {"texts": {}} and trigger KeyError here. That exception is then surfaced as INTERNAL by RenderCompletion (it only maps ValueError/TypeError), misclassifying a client-side payload issue and breaking empty-prompt edge cases.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guarded nested key access in flatten_completion_prompt with .get() to handle empty repeated fields dropped by MessageToDict.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/grpc_client/proto/vllm_render.proto`:
- Around line 82-85: Wrap content and content_parts in a proto oneof so they are
mutually exclusive: in ChatCompletionMessage inside
crates/grpc_client/proto/vllm_render.proto replace the two standalone fields
content and content_parts with a oneof (e.g. oneof content_oneof { string
content = 2; repeated ContentPart content_parts = 3; }) so gRPC/protobuf
enforces exclusivity; after regenerating protos, update any code paths that
reference the fields (notably ChatCompletionMessage handling and the remapping
logic in grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py) to read from
the oneof (check which oneof case is set) before mapping to
ChatCompletionRequest.
- Around line 1-20: The build script currently only compiles common.proto,
sglang_scheduler.proto, vllm_engine.proto, and trtllm_service.proto, so the
VllmRender service and its messages (VllmRender, RenderChatRequest/Response,
RenderCompletionRequest/Response) are not being generated; update
crates/grpc_client/build.rs to include "proto/vllm_render.proto" in the list
passed to tonic_build/prost_build (or whichever proto compiler is used), ensure
the include path that contains vllm_render.proto is present, and re-run the
build to generate bindings for the VllmRender service and its RPCs (RenderChat
and RenderCompletion).

In `@grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py`:
- Around line 70-96: The ValueError/TypeError handler currently wraps both input
parsing and response serialization in RenderChat, causing server-side
serialization errors to return INVALID_ARGUMENT; refactor the handler so input
validation (from_proto and render_chat_request call) is inside a narrow try that
catches ValueError/TypeError and aborts with INVALID_ARGUMENT, then perform
pydantic_to_proto(...) and response assembly in a separate block so any
serialization errors bubble to the broader Exception handler (which logs and
aborts with INTERNAL); apply the same split to the RenderCompletion handler and
preserve the existing grpc.aio.AbortError re-raise behavior and ErrorResponse
handling (references: from_proto, render_chat_request, pydantic_to_proto,
RenderChat and RenderCompletion handlers).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f658d408-ad63-45fd-987e-562ba8c45d2c

📥 Commits

Reviewing files that changed from the base of the PR and between 9ee771a and ace448a.

📒 Files selected for processing (2)
  • crates/grpc_client/proto/vllm_render.proto
  • grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…ilures in RenderChat and RenderCompletion

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
…ty fields and batches

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@mergify
Copy link

mergify bot commented Mar 17, 2026

Hi @hyeongyun0916, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Mar 17, 2026
…m-render

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
@mergify mergify bot removed the needs-rebase PR has merge conflicts that need to be resolved label Mar 18, 2026
@slin1237
Copy link
Collaborator

@hyeongyun0916 i get what this is trying to do
however,
the whole tokenization, apply chat template, function calls, reasonings are already done in SMG
the whole idea of SMG is to disaggregate serving and engine
and the communication between smg and engine are done via rpc
if i understood the RFC properly, everything is still handled in python?
im a bit lost on the duplicated effort, smg can easily expose an early return for entire chat pipeline result before sending to the model
should we do that instead?

@slin1237
Copy link
Collaborator

reading more into the design, i think i have a better understanding now
the goal is to have vllm to expose an endpoint that literally does nothing but expose an endpoint to handle everything for chat completion pipeline except actual generation
this includes

  • pre-processing (apply chat template, tokenization)
  • post-processing (detokenization, reasoning parsing, function call parsing)
    im assuming the goal is to reduce cpu overhead as much as possible, and this gpu-less vllm can scale horizontally?
    note that this adds a lot of network overhead and development overhead
    i think, if we really wanna do that
    smg exposes each component as stand alone crate
  1. tokenizer (including tokenization caching, detoeknization, stop sequence decoder etc)
  2. reasoning parser
  3. function call parser
    each one is stand alone component and pushed to crate.io, its fairly easy to create an python binding for it
    now, its still not trivial to put them into a pipeline
    smg itself, communicate to engine via rpc, and its pure tokens out and tokens back, the whole pipeline is built like a lego, composing each crates
    if we truly want an component that doesnt handle generation, just allow the pipeline to return early
    and its all rust, much more scalable than python

@hyeongyun0916
Copy link
Author

hyeongyun0916 commented Mar 18, 2026

reading more into the design, i think i have a better understanding now the goal is to have vllm to expose an endpoint that literally does nothing but expose an endpoint to handle everything for chat completion pipeline except actual generation this includes

  • pre-processing (apply chat template, tokenization)
  • post-processing (detokenization, reasoning parsing, function call parsing)
    im assuming the goal is to reduce cpu overhead as much as possible, and this gpu-less vllm can scale horizontally?
    note that this adds a lot of network overhead and development overhead
    i think, if we really wanna do that
    smg exposes each component as stand alone crate
  1. tokenizer (including tokenization caching, detoeknization, stop sequence decoder etc)
  2. reasoning parser
  3. function call parser
    each one is stand alone component and pushed to crate.io, its fairly easy to create an python binding for it
    now, its still not trivial to put them into a pipeline
    smg itself, communicate to engine via rpc, and its pure tokens out and tokens back, the whole pipeline is built like a lego, composing each crates
    if we truly want an component that doesnt handle generation, just allow the pipeline to return early
    and its all rust, much more scalable than python

Thanks for the detailed suggestion!

One concern: vLLM uses custom tokenizer implementations for several models (e.g., deepseek_v32, mistral, grok2) that bypass the standard Jinja2 template. For DeepSeek V3.2 specifically, tokenizer_config.json has no
chat_template at all.

I'm not sure how many of these cases SMG currently covers, and whether the resulting token_ids would match vLLM's output for all of them.

@slin1237
Copy link
Collaborator

thanks for the reply
SMG supports dpsk3.2, customized tokenizer, customized chat template, and depending on how chat template is defined(config, its own file etc), it supports them all
SMG recently added tiktokenizer support, since kimi k2 uses that, and tiktokenizer has the same level of support as regular hf tokenizer in smg

…ple files

Signed-off-by: HyunKyun Moon <mhg5303@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py`:
- Around line 98-103: The RenderCompletion method may continue executing after
calling context.abort(), so add an explicit return immediately after the await
context.abort(...) call inside RenderCompletion to ensure the method exits when
self.state.openai_serving_render is None; update the RenderCompletion function
(reference: RenderCompletion, context.abort, self.state.openai_serving_render)
to place a return following the abort invocation.
- Around line 65-70: In RenderChat, after calling await context.abort(...) add
an explicit return to make control flow clear and satisfy linters; specifically,
update the async method RenderChat so that immediately after the abort call you
return (so subsequent references like result on RenderChat's later code path are
not flagged as potentially uninitialized).

In `@grpc_servicer/tests/test_field_transforms.py`:
- Around line 51-61: Move the inline imports of pytest out of the test functions
into a single module-level import at the top of the file; specifically remove
the "import pytest" lines inside test_empty_dict_raises_value_error and
test_unknown_keys_raises_value_error and add a single "import pytest" at the
file scope so both functions (and any other tests) use the module-level pytest
import.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 87a0f179-a5e8-40f4-b89c-1706285c49ec

📥 Commits

Reviewing files that changed from the base of the PR and between eca0216 and 3e07200.

📒 Files selected for processing (5)
  • grpc_servicer/smg_grpc_servicer/vllm/field_transforms.py
  • grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py
  • grpc_servicer/tests/test_field_transforms.py
  • grpc_servicer/tests/test_proto_utils.py
  • grpc_servicer/tests/test_render_servicer.py

Comment on lines +65 to +70
async def RenderChat(self, request, context):
if self.state.openai_serving_render is None:
await context.abort(
grpc.StatusCode.UNIMPLEMENTED,
"RenderChat is not configured on this server.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add explicit return after context.abort() for clarity.

context.abort() raises grpc.aio.AbortError internally, so the code after it won't execute. However, adding an explicit return makes the control flow clearer and prevents linter warnings about potentially unbound variables (result on line 92).

♻️ Suggested fix
     async def RenderChat(self, request, context):
         if self.state.openai_serving_render is None:
             await context.abort(
                 grpc.StatusCode.UNIMPLEMENTED,
                 "RenderChat is not configured on this server.",
             )
+            return  # abort raises, but explicit return aids readability
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py` around lines 65 -
70, In RenderChat, after calling await context.abort(...) add an explicit return
to make control flow clear and satisfy linters; specifically, update the async
method RenderChat so that immediately after the abort call you return (so
subsequent references like result on RenderChat's later code path are not
flagged as potentially uninitialized).

Comment on lines +98 to +103
async def RenderCompletion(self, request, context):
if self.state.openai_serving_render is None:
await context.abort(
grpc.StatusCode.UNIMPLEMENTED,
"RenderCompletion is not configured on this server.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Same suggestion: add explicit return after context.abort().

Same clarity improvement applies to RenderCompletion.

♻️ Suggested fix
     async def RenderCompletion(self, request, context):
         if self.state.openai_serving_render is None:
             await context.abort(
                 grpc.StatusCode.UNIMPLEMENTED,
                 "RenderCompletion is not configured on this server.",
             )
+            return  # abort raises, but explicit return aids readability
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/smg_grpc_servicer/vllm/render_servicer.py` around lines 98 -
103, The RenderCompletion method may continue executing after calling
context.abort(), so add an explicit return immediately after the await
context.abort(...) call inside RenderCompletion to ensure the method exits when
self.state.openai_serving_render is None; update the RenderCompletion function
(reference: RenderCompletion, context.abort, self.state.openai_serving_render)
to place a return following the abort invocation.

Comment on lines +51 to +61
def test_empty_dict_raises_value_error(self):
import pytest

with pytest.raises(ValueError, match="no supported oneof field set"):
flatten_completion_prompt({})

def test_unknown_keys_raises_value_error(self):
import pytest

with pytest.raises(ValueError, match="no supported oneof field set"):
flatten_completion_prompt({"unknown": "value"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Move pytest import to module level.

pytest is imported inline within test methods (lines 52 and 58). Since pytest is already a test dependency and used elsewhere, move the import to the top of the file for consistency.

♻️ Suggested fix
 from smg_grpc_servicer.vllm.field_transforms import (
     FIELD_TRANSFORMS,
     _ensure_message_content,
     _parse_tool_choice,
     flatten_completion_prompt,
 )
+import pytest

Then remove the inline imports:

     def test_empty_dict_raises_value_error(self):
-        import pytest
-
         with pytest.raises(ValueError, match="no supported oneof field set"):
             flatten_completion_prompt({})

     def test_unknown_keys_raises_value_error(self):
-        import pytest
-
         with pytest.raises(ValueError, match="no supported oneof field set"):
             flatten_completion_prompt({"unknown": "value"})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_empty_dict_raises_value_error(self):
import pytest
with pytest.raises(ValueError, match="no supported oneof field set"):
flatten_completion_prompt({})
def test_unknown_keys_raises_value_error(self):
import pytest
with pytest.raises(ValueError, match="no supported oneof field set"):
flatten_completion_prompt({"unknown": "value"})
def test_empty_dict_raises_value_error(self):
with pytest.raises(ValueError, match="no supported oneof field set"):
flatten_completion_prompt({})
def test_unknown_keys_raises_value_error(self):
with pytest.raises(ValueError, match="no supported oneof field set"):
flatten_completion_prompt({"unknown": "value"})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@grpc_servicer/tests/test_field_transforms.py` around lines 51 - 61, Move the
inline imports of pytest out of the test functions into a single module-level
import at the top of the file; specifically remove the "import pytest" lines
inside test_empty_dict_raises_value_error and
test_unknown_keys_raises_value_error and add a single "import pytest" at the
file scope so both functions (and any other tests) use the module-level pytest
import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants