-
Notifications
You must be signed in to change notification settings - Fork 709
fix: Support for msg[content] as a list #4485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
WalkthroughChanges add support for a new multimodal LLaVA model by implementing content array format detection, introducing format-aware message transformation logic, and extending model configuration. The detection mechanism identifies whether templates require array-based content at initialization time, then uses this information to steer message formatting decisions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
lib/llm/src/preprocessor/prompt/template/formatters.rs (2)
9-39: Content-format detection logic looks sound but may under-detect in exotic templatesThe
detect_content_array_usageprobe is straightforward and safe (errors → empty string), and it correctly distinguishes templates that break on string content. However, it only declaresrequires_content_arrays = truewhen the array probe surfaces"template_test"and the string probe does not. Templates that never surface user content (or only use it for control flow) will silently fall back tofalseeven if they require arrays, which may cause subtle mismatches.If you expect such edge-case templates, consider:
- Allowing an explicit override in
ChatTemplate/ MDC, or- Making the probe more flexible (e.g., checking for successful render vs. substring match, or allowing a configurable sentinel key).
152-161: Detection currently ignores tool-only templates
detect_content_array_usage(&env)always probes"default", but in the map case it’s possible to have only"tool_use"registered. In that situation,get_template("default")fails andrequires_content_arraysis forced tofalse, even if the actual tool template expects arrays, somay_be_fix_msg_contentwill convert text-only arrays back to strings.If you intend to support tool-only templates, consider:
- Probing
"tool_use"when"default"is absent, or- Probing all registered templates and OR-ing their results.
lib/llm/src/preprocessor/prompt/template.rs (1)
104-110: Struct extension is consistent with new behaviorAdding
requires_content_arrays: boolhere aligns with the new detection and render logic and doesn’t introduce construction hazards given the singlenew()path.You might consider adding a brief doc comment on the field (e.g., “true if the underlying chat template only supports array-form content”) to avoid future confusion when other formatters are added.
lib/llm/src/preprocessor/prompt/template/oai.rs (1)
172-175: Message flow centralization is good; minor opportunity to avoid double serializationSwitching
NvCreateChatCompletionRequest::messages()to just serializeself.inner.messagesand then normalizing inOAIPromptFormatter::render()centralizes content handling and keeps the trait implementation simple.In
render, you currently go:
req.messages()→Valueserde_json::to_value(...)→serde_json::Valuemay_be_fix_msg_content(...)→Valueserde_json::to_value(...)→serde_json::ValueYou could slightly simplify and reduce conversions by having
may_be_fix_msg_contentoperate directly onserde_json::Valueand return that, or by calling it before converting into aValuein the first place. Not urgent, but it would shave some overhead on a hot path.Also applies to: 287-310
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/backends/vllm/launch/agg_multimodal.sh(1 hunks)lib/llm/src/preprocessor/prompt/template.rs(1 hunks)lib/llm/src/preprocessor/prompt/template/formatters.rs(2 hunks)lib/llm/src/preprocessor/prompt/template/oai.rs(15 hunks)tests/serve/test_vllm.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
📚 Learning: 2025-09-16T19:47:30.312Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
Applied to files:
lib/llm/src/preprocessor/prompt/template/formatters.rslib/llm/src/preprocessor/prompt/template/oai.rslib/llm/src/preprocessor/prompt/template.rs
📚 Learning: 2025-09-22T18:09:23.513Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3165
File: components/backends/sglang/src/dynamo/sglang/args.py:201-202
Timestamp: 2025-09-22T18:09:23.513Z
Learning: KrishnanPrash suggested adding early validation for custom Jinja template paths in the Rust layer (lib/bindings/python/rust/lib.rs) to benefit both vLLM and SGLang workflows, using PathBuf::from() and path.exists() checks with appropriate PyFileNotFoundError handling.
Applied to files:
lib/llm/src/preprocessor/prompt/template/formatters.rs
📚 Learning: 2025-09-10T22:32:12.978Z
Learnt from: zhongdaor-nv
Repo: ai-dynamo/dynamo PR: 2999
File: lib/parsers/src/tool_calling/harmony/harmony_parser.rs:250-256
Timestamp: 2025-09-10T22:32:12.978Z
Learning: In lib/parsers/src/tool_calling/harmony/harmony_parser.rs, the team prefers to maintain identical code patterns between parse_tool_calls_harmony and parse_tool_calls_harmony_complete functions, including message.content[0] indexing, to ensure consistency between streaming and complete parser implementations.
Applied to files:
lib/llm/src/preprocessor/prompt/template/oai.rs
🧬 Code graph analysis (3)
lib/llm/src/preprocessor/prompt/template/formatters.rs (2)
lib/llm/src/preprocessor/prompt/template/oai.rs (3)
messages(172-175)messages(223-234)supports_add_generation_prompt(283-285)lib/llm/src/preprocessor/prompt.rs (3)
messages(51-51)supports_add_generation_prompt(82-82)supports_add_generation_prompt(95-97)
lib/llm/src/preprocessor/prompt/template/oai.rs (1)
lib/llm/src/preprocessor/prompt.rs (1)
messages(51-51)
tests/serve/test_vllm.py (1)
tests/utils/payload_builder.py (2)
chat_payload(129-156)chat_payload_default(18-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
🔇 Additional comments (4)
examples/backends/vllm/launch/agg_multimodal.sh (1)
45-51: New LLaVA branch is consistent with existing Qwen configThe added
elifforllava-hf/llava-1.5-7b-hfmirrors the Qwen settings and cleanly plugs into the existing EXTRA_ARGS logic. Looks good; you can later tune--max-model-lenor GPU utilization independently if needed.tests/serve/test_vllm.py (1)
220-251: LLaVA multimodal config mirrors Qwen coverage and exercises new pathThe
multimodal_agg_llavaentry cleanly follows the existing Qwen aggregated setup and, importantly, adds:
- An image_url payload to validate multimodal behavior, and
- A
chat_payload_defaultto exercise the string → array normalization.This looks like solid end-to-end coverage for the new logic.
lib/llm/src/preprocessor/prompt/template/oai.rs (2)
76-135: Bidirectional msg.content normalization is correct and preserves multimodal safetyThe updated
may_be_fix_msg_contentcleanly separates the two modes:
preserve_arrays = true: strings are wrapped into a single{"type":"text","text":...}element, leaving existing arrays untouched.preserve_arrays = false: non-empty, text-only arrays are concatenated with\n, while mixed / non-text / empty arrays are preserved.This matches the desired behavior: standard templates get simple strings, while multimodal templates can rely on array form without losing mixed content. Given the prior guarantee that production traffic here is text-only arrays, the extra guards around mixed and non-text types add nice future safety without regressions. Based on learnings.
450-478: Test coverage around content normalization and multimodal/tool interaction is comprehensiveThe new and updated tests exercise:
- Array → string behavior for single and multiple messages, including system/assistant roles and empty arrays.
- Preservation of mixed and non-text-only arrays (image/video/audio) in both standard and multimodal-like scenarios.
- Interaction with tool-call argument normalization in the presence of multimodal content.
- String → array conversion and array preservation when
preserve_arrays=true.This suite gives strong confidence that the new normalization rules hold across both typical text-only and richer multimodal-shaped payloads, and that tool handling remains correct.
Also applies to: 482-541, 545-565, 569-591, 595-653, 720-770, 875-921, 923-978
|
This PR adds necessary support for multimodal models (like |
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
|
LGTM, but needs rust checks fixed |
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>
Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com> Signed-off-by: KrishnanPrash <140860868+KrishnanPrash@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Overview:
Copy of #4380. Reopened new PR for ease of review + merge.
Chat templates have conflicting expectations for message content format:
"Hello"[{"type": "text", "text": "Hello"}]When the wrong format is provided, content goes missing or renders as malformed in prompts.
"t1\nt2")[{"type": "text", "text": "..."}])Details:
detect_content_array_usage()informatters.rsrequires_content_arraysfield toHfTokenizerConfigJsonFormattermay_be_fix_msg_content()bidirectional withpreserve_arraysparameterRelated PRs:
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.