[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801

jeffreywang-anyscale · 2026-01-02T00:04:02Z

Description

When using vLLMEngineProcessor, vLLMEngineStage only requires either prompt or tokenized_prompt to be present. However, the current implementation raises a validation error when prompt is missing.

This PR updates the validation logic in vLLMEngineStage to allow inputs where the prompt column is absent, as long as tokenized_prompt is provided. If both prompt and tokenized_prompt are present, the existing behavior is preserved where tokenized_prompt is preferred and tokens are passed directly to the engine.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request correctly updates the validation logic in vLLMEngineStage to allow either prompt or tokenized_prompt as input, which aligns with the behavior of vLLMEngineProcessor. The changes are logical and are accompanied by good test coverage, including both positive and negative test cases.

I have one suggestion to refactor the validate_inputs method in vllm_engine_stage.py to improve maintainability by avoiding temporary state modification. Otherwise, the changes look good.

gemini-code-assist · 2026-01-02T00:05:38Z

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py

+    def validate_inputs(self, inputs: List[Dict[str, Any]]):
+        """Validate the inputs to make sure the required keys are present.
+        
+        Overrides base class to handle the requirement for prompt/tokenized_prompt.
+
+        Args:
+            inputs: The inputs.
+
+        Raises:
+            ValueError: If the required keys are not found.
+        """
+        for inp in inputs:
+            input_keys = set(inp.keys())
+
+            if "prompt" not in input_keys and "tokenized_prompt" not in input_keys:
+                raise ValueError(
+                    "Either 'prompt' (text) or 'tokenized_prompt' (tokens) "
+                    f"must be provided. Input keys: {input_keys}"
+                )
+
+        original_expected_keys = self.expected_input_keys.copy()
+        self.expected_input_keys = self.expected_input_keys - {"prompt", "tokenized_prompt"}
+
+        try:
+            super().validate_inputs(inputs)
+        finally:
+            self.expected_input_keys = original_expected_keys


The current implementation of validate_inputs temporarily modifies the instance state (self.expected_input_keys). While this works and is protected by a try...finally block, it can be fragile and is generally not a good practice as it can lead to subtle bugs, especially if the class is used in a concurrent environment in the future.

A cleaner approach would be to reimplement the validation logic without modifying instance state. This would involve checking for prompt or tokenized_prompt and then checking for the other expected keys, similar to what the superclass does. This makes the method self-contained and easier to reason about.

def validate_inputs(self, inputs: List[Dict[str, Any]]): """Validate the inputs to make sure the required keys are present. Overrides base class to handle the requirement for prompt/tokenized_prompt. Args: inputs: The inputs. Raises: ValueError: If the required keys are not found. """ # All expected keys except for prompt/tokenized_prompt, which are handled specially. other_expected_keys = self.expected_input_keys - {"prompt", "tokenized_prompt"} for inp in inputs: input_keys = set(inp.keys()) if self.IDX_IN_BATCH_COLUMN in input_keys: raise ValueError( f"The input column {self.IDX_IN_BATCH_COLUMN} is reserved " "for internal use." ) if "prompt" not in input_keys and "tokenized_prompt" not in input_keys: raise ValueError( "Either 'prompt' (text) or 'tokenized_prompt' (tokens) " f"must be provided. Input keys: {input_keys}" ) # Check for other required keys. missing_required = other_expected_keys - input_keys if missing_required: raise ValueError( f"Required input keys {missing_required} not found at the input of " f"{self.__class__.__name__}. Input keys: {input_keys}" )

jeffreywang-anyscale · 2026-01-02T00:06:15Z

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py

+        ret = {
+            "prompt": "The text prompt (str). Required if tokenized_prompt is not provided. Either prompt or tokenized_prompt must be provided.",
+            "tokenized_prompt": "The tokenized prompt. Required if prompt is not provided. Either prompt or tokenized_prompt must be provided.",
+        }


Open to suggestions -- at least one of prompt or tokenized_prompt should be provided. Is marking both as required the right heuristic?

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py

kouroshHakha · 2026-01-05T18:05:06Z

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py

+            if "prompt" not in input_keys and "tokenized_prompt" not in input_keys:
+                raise ValueError(
+                    "Either 'prompt' (text) or 'tokenized_prompt' (tokens) "
+                    f"must be provided. Input keys: {input_keys}"
+                )
+
+        original_expected_keys = self.expected_input_keys.copy()
+        self.expected_input_keys = self.expected_input_keys - {
+            "prompt",
+            "tokenized_prompt",
+        }
+
+        try:
+            super().validate_inputs(inputs)


what happens if we don't validate here? And let the engine fail? What would be the error message coming from the engine? It won't fail silently, right?

The reason that I am interested in not validating like this at all, is that we are adding too much extra complexity for a little gain on input key validation in a case where we want either/or type of expectation.

In fact, we are validating that prompt should exist if tokenized_prompt does not present here. From the assertion, users may be confused that prompt must be provided. Although it's not immediately clear that only 1 of prompt or tokenized prompt should be provided, I agree that the additional validation complexity isn’t worthwhile.

Note: Even if the existing validation is removed, the engine will raise vllm.v1.engine.exceptions.EngineGenerateError.

Let's put both of them under get_optional_input_keys then.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor · 2026-01-07T07:37:12Z

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py

+    @root_validator(pre=True)
+    def validate_prompt_or_prompt_token_ids(cls, values):
+        if not values.get("prompt") and not values.get("prompt_token_ids"):
+            raise ValueError("Either 'prompt' or 'prompt_token_ids' must be provided.")


Error message uses internal field name instead of user-facing name

Low Severity

The error message says "Either 'prompt' or 'prompt_token_ids' must be provided" but users provide data using the field name tokenized_prompt, not prompt_token_ids. The _prepare_llm_request method maps the user-facing tokenized_prompt field to the internal prompt_token_ids field when creating vLLMEngineRequest. This mismatch between the error message and the user-facing API could confuse users who receive this validation error.

kouroshHakha

code looks good.

jeffreywang-anyscale · 2026-01-08T01:32:47Z

Looking into premerge failures 🤔

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale requested a review from a team as a code owner January 2, 2026 00:04

gemini-code-assist bot reviewed Jan 2, 2026

View reviewed changes

jeffreywang-anyscale commented Jan 2, 2026

View reviewed changes

cursor bot reviewed Jan 2, 2026

View reviewed changes

python/ray/llm/_internal/batch/stages/vllm_engine_stage.py Outdated Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues llm labels Jan 2, 2026

jeffreywang-anyscale force-pushed the optional-prompt branch from 0ee6679 to 0b71f8f Compare January 5, 2026 03:54

kouroshHakha reviewed Jan 5, 2026

View reviewed changes

jeffreywang-anyscale force-pushed the optional-prompt branch 2 times, most recently from 59a4791 to 88ef46a Compare January 7, 2026 07:27

jeffreywang-anyscale added 4 commits January 6, 2026 23:28

Either prompt or tokenized_prompt should be provided

3b9451b

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Fix linter

5022efc

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Fix linter

c928a41

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

CR feedback

65e4d2c

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale force-pushed the optional-prompt branch from 88ef46a to 65e4d2c Compare January 7, 2026 07:29

jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Jan 7, 2026

cursor bot reviewed Jan 7, 2026

View reviewed changes

kouroshHakha approved these changes Jan 7, 2026

View reviewed changes

kouroshHakha enabled auto-merge (squash) January 8, 2026 00:55

Remove flaky test

520e193

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang-anyscale force-pushed the optional-prompt branch from 679dd22 to 520e193 Compare January 8, 2026 05:37

github-actions bot disabled auto-merge January 8, 2026 05:41

kouroshHakha approved these changes Jan 8, 2026

View reviewed changes

kouroshHakha merged commit a1b4a6d into ray-project:master Jan 8, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801

[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801

Uh oh!

jeffreywang-anyscale commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 2, 2026

Uh oh!

jeffreywang-anyscale Jan 2, 2026

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jan 5, 2026

Uh oh!

jeffreywang-anyscale Jan 7, 2026 •

edited

Loading

Uh oh!

jeffreywang-anyscale Jan 7, 2026

Uh oh!

cursor bot Jan 7, 2026

Uh oh!

kouroshHakha left a comment

Uh oh!

jeffreywang-anyscale commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801

[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801

Uh oh!

Conversation

jeffreywang-anyscale commented Jan 2, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

Error message uses internal field name instead of user-facing name

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreywang-anyscale Jan 7, 2026 •

edited

Loading