-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data][llm] Allow tokenized_prompt without prompt in vLLMEngineStage #59801
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly updates the validation logic in vLLMEngineStage to allow either prompt or tokenized_prompt as input, which aligns with the behavior of vLLMEngineProcessor. The changes are logical and are accompanied by good test coverage, including both positive and negative test cases.
I have one suggestion to refactor the validate_inputs method in vllm_engine_stage.py to improve maintainability by avoiding temporary state modification. Otherwise, the changes look good.
| def validate_inputs(self, inputs: List[Dict[str, Any]]): | ||
| """Validate the inputs to make sure the required keys are present. | ||
| Overrides base class to handle the requirement for prompt/tokenized_prompt. | ||
| Args: | ||
| inputs: The inputs. | ||
| Raises: | ||
| ValueError: If the required keys are not found. | ||
| """ | ||
| for inp in inputs: | ||
| input_keys = set(inp.keys()) | ||
|
|
||
| if "prompt" not in input_keys and "tokenized_prompt" not in input_keys: | ||
| raise ValueError( | ||
| "Either 'prompt' (text) or 'tokenized_prompt' (tokens) " | ||
| f"must be provided. Input keys: {input_keys}" | ||
| ) | ||
|
|
||
| original_expected_keys = self.expected_input_keys.copy() | ||
| self.expected_input_keys = self.expected_input_keys - {"prompt", "tokenized_prompt"} | ||
|
|
||
| try: | ||
| super().validate_inputs(inputs) | ||
| finally: | ||
| self.expected_input_keys = original_expected_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of validate_inputs temporarily modifies the instance state (self.expected_input_keys). While this works and is protected by a try...finally block, it can be fragile and is generally not a good practice as it can lead to subtle bugs, especially if the class is used in a concurrent environment in the future.
A cleaner approach would be to reimplement the validation logic without modifying instance state. This would involve checking for prompt or tokenized_prompt and then checking for the other expected keys, similar to what the superclass does. This makes the method self-contained and easier to reason about.
def validate_inputs(self, inputs: List[Dict[str, Any]]):
"""Validate the inputs to make sure the required keys are present.
Overrides base class to handle the requirement for prompt/tokenized_prompt.
Args:
inputs: The inputs.
Raises:
ValueError: If the required keys are not found.
"""
# All expected keys except for prompt/tokenized_prompt, which are handled specially.
other_expected_keys = self.expected_input_keys - {"prompt", "tokenized_prompt"}
for inp in inputs:
input_keys = set(inp.keys())
if self.IDX_IN_BATCH_COLUMN in input_keys:
raise ValueError(
f"The input column {self.IDX_IN_BATCH_COLUMN} is reserved "
"for internal use."
)
if "prompt" not in input_keys and "tokenized_prompt" not in input_keys:
raise ValueError(
"Either 'prompt' (text) or 'tokenized_prompt' (tokens) "
f"must be provided. Input keys: {input_keys}"
)
# Check for other required keys.
missing_required = other_expected_keys - input_keys
if missing_required:
raise ValueError(
f"Required input keys {missing_required} not found at the input of "
f"{self.__class__.__name__}. Input keys: {input_keys}"
)| ret = { | ||
| "prompt": "The text prompt (str). Required if tokenized_prompt is not provided. Either prompt or tokenized_prompt must be provided.", | ||
| "tokenized_prompt": "The tokenized prompt. Required if prompt is not provided. Either prompt or tokenized_prompt must be provided.", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to suggestions -- at least one of prompt or tokenized_prompt should be provided. Is marking both as required the right heuristic?
0ee6679 to
0b71f8f
Compare
| if "prompt" not in input_keys and "tokenized_prompt" not in input_keys: | ||
| raise ValueError( | ||
| "Either 'prompt' (text) or 'tokenized_prompt' (tokens) " | ||
| f"must be provided. Input keys: {input_keys}" | ||
| ) | ||
|
|
||
| original_expected_keys = self.expected_input_keys.copy() | ||
| self.expected_input_keys = self.expected_input_keys - { | ||
| "prompt", | ||
| "tokenized_prompt", | ||
| } | ||
|
|
||
| try: | ||
| super().validate_inputs(inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we don't validate here? And let the engine fail? What would be the error message coming from the engine? It won't fail silently, right?
The reason that I am interested in not validating like this at all, is that we are adding too much extra complexity for a little gain on input key validation in a case where we want either/or type of expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, we are validating that prompt should exist if tokenized_prompt does not present here. From the assertion, users may be confused that prompt must be provided. Although it's not immediately clear that only 1 of prompt or tokenized prompt should be provided, I agree that the additional validation complexity isn’t worthwhile.
Note: Even if the existing validation is removed, the engine will raise vllm.v1.engine.exceptions.EngineGenerateError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put both of them under get_optional_input_keys then.
59a4791 to
88ef46a
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
88ef46a to
65e4d2c
Compare
| @root_validator(pre=True) | ||
| def validate_prompt_or_prompt_token_ids(cls, values): | ||
| if not values.get("prompt") and not values.get("prompt_token_ids"): | ||
| raise ValueError("Either 'prompt' or 'prompt_token_ids' must be provided.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message uses internal field name instead of user-facing name
Low Severity
The error message says "Either 'prompt' or 'prompt_token_ids' must be provided" but users provide data using the field name tokenized_prompt, not prompt_token_ids. The _prepare_llm_request method maps the user-facing tokenized_prompt field to the internal prompt_token_ids field when creating vLLMEngineRequest. This mismatch between the error message and the user-facing API could confuse users who receive this validation error.
kouroshHakha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good.
|
Looking into premerge failures 🤔 |
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
679dd22 to
520e193
Compare
Description
When using vLLMEngineProcessor, vLLMEngineStage only requires either
promptortokenized_promptto be present. However, the current implementation raises a validation error when prompt is missing.This PR updates the validation logic in vLLMEngineStage to allow inputs where the
promptcolumn is absent, as long astokenized_promptis provided. If bothpromptandtokenized_promptare present, the existing behavior is preserved wheretokenized_promptis preferred and tokens are passed directly to the engine.Related issues
Additional information