Skip to content

Conversation

@chandlj
Copy link
Contributor

@chandlj chandlj commented Nov 17, 2025

Overview:

This PR updates the Dynamo Dockerfiles to build NIXL with libfabric support and install the necessary runtime packages. Builds off #4317 to allow users to set the NixlConnector backend to LIBFABRIC using --kv-transfer-config.

Details:

Builds off #4317 by updating each respective framework Dockerfile as well as the base Dockerfile to build NIXL v0.7.1 with Libfabric support. Also bumps the vLLM version to 0.11.0 to have support for setting NixlConnector backends.

Where should the reviewer start?

Dockerfile, Dockerfile.[framework], pyproject.toml

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to #4185

Summary by CodeRabbit

  • Chores
    • Updated NIXL dependency from 0.7.0 to 0.7.1
    • Updated vLLM dependency to 0.11.0
    • Updated sglang NIXL dependency to 0.7.1
    • Added libfabric v2.3.0 support across container images for enhanced networking capabilities

@chandlj chandlj requested review from a team as code owners November 17, 2025 19:53
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi chandlj! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor feat labels Nov 17, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 17, 2025

Walkthrough

The PR enhances KV transfer configuration validation in vLLM arguments (adding normalization and "none"/"null" handling) and adds libfabric library support across build configuration and Dockerfiles. NIXL is updated from 0.7.0 to 0.7.1, and vllm[flashinfer] is pinned to 0.11.0.

Changes

Cohort / File(s) Summary
KV Transfer Config Validation
components/src/dynamo/vllm/args.py
Enhanced conflict checking between --kv-transfer-config and --connector with connector list normalization to lowercase, support for "none"/"null" to clear config, explicit error handling for invalid combinations, and propagation of validated connectors.
Libfabric Build Configuration
container/Dockerfile, container/build.sh
Added NIXL_LIBFABRIC_REF (v2.3.0) variable across build stages, introduced dedicated libfabric compilation from source to /usr/local/libfabric, updated NIXL reference to 0.7.1, added hwloc dependencies, propagated libfabric paths through build arguments and environment variables.
Runtime Container Libfabric Integration
container/Dockerfile.sglang, container/Dockerfile.trtllm, container/Dockerfile.vllm
Consistently added libhwloc15 dependency, copied /usr/local/libfabric from base image, and extended LD_LIBRARY_PATH to include libfabric library directory in runtime stages across three container variants.
Dependency Version Updates
pyproject.toml
Updated nixl constraint from ≤0.7.0 to ≤0.7.1 in vllm and sglang optional dependencies; bumped vllm[flashinfer] from ==0.10.2 to ==0.11.0.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~20 minutes

  • args.py: Pay attention to the validation logic flow for "none"/"null" handling and ensure error messages clearly indicate invalid connector combinations and the normalized valid set.
  • Dockerfile variants: While repetitive, verify that libfabric integration is consistent across all three runtime images (sglang, trtllm, vllm) and that environment variables are correctly propagated to dependent stages.
  • NIXL version compatibility: Confirm that bumping NIXL from 0.7.0 to 0.7.1 and vllm[flashinfer] to 0.11.0 do not introduce breaking changes in the args.py connector logic or build process.

Poem

🐰 The Connector Dance

Validating channels with care,
Normalizing connectors with flair,
Libfabric built from the ground,
Through Dockerfiles it travels around,
Dependencies bump—we're ready to share! 🚀

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Libfabric support' accurately reflects the main change of adding Libfabric support across Dockerfiles and dependency updates.
Description check ✅ Passed The PR description follows the template with Overview, Details, and Where to start sections, though the Related Issues section uses 'Relates to' which is acceptable.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining why this PR is needed, why this solution was chosen, and what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/src/dynamo/vllm/args.py (1)

293-300: Consider adding a comment explaining the flag interaction design.

The conflict detection logic is correct and enforces that users must explicitly specify --connector none when using --kv-transfer-config. This prevents accidental mixing of CLI connectors with custom transfer configs.

Consider adding a comment above line 293 to clarify the design intent:

+    # When not using none/null, enforce that --connector and --kv-transfer-config
+    # are mutually exclusive. Users must explicitly use --connector none with
+    # --kv-transfer-config to avoid conflicts.
     else:

This will help future maintainers understand why the conflict check is scoped to the else block.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6afb4d5 and 09bafc8.

📒 Files selected for processing (7)
  • components/src/dynamo/vllm/args.py (2 hunks)
  • container/Dockerfile (8 hunks)
  • container/Dockerfile.sglang (3 hunks)
  • container/Dockerfile.trtllm (3 hunks)
  • container/Dockerfile.vllm (3 hunks)
  • container/build.sh (2 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • container/Dockerfile.vllm
📚 Learning: 2025-09-03T01:10:12.599Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.

Applied to files:

  • container/Dockerfile
📚 Learning: 2025-07-30T00:34:35.810Z
Learnt from: dmitry-tokarev-nv
Repo: ai-dynamo/dynamo PR: 2179
File: docs/support_matrix.md:61-63
Timestamp: 2025-07-30T00:34:35.810Z
Learning: In docs/support_matrix.md, the NIXL version difference between runtime dependencies (0.5.0) and build dependencies (0.4.0) is intentional and expected, not an error that needs to be corrected.

Applied to files:

  • container/build.sh
🪛 Ruff (0.14.5)
components/src/dynamo/vllm/args.py

277-279: Avoid specifying long messages outside the exception class

(TRY003)


289-291: Avoid specifying long messages outside the exception class

(TRY003)


296-298: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (13)
components/src/dynamo/vllm/args.py (3)

272-279: LGTM! Clear validation with helpful error messages.

The connector normalization and validation logic is well-implemented. The error message clearly indicates both the invalid connectors and the valid options.

Note: Static analysis suggests extracting the error messages into exception subclasses (TRY003), but this is a minor style preference and can be deferred.


281-292: LGTM! Proper handling of "none"/"null" connectors.

The logic correctly:

  • Detects user-provided kv_transfer_config with defensive checks
  • Prevents mixing "none"/"null" with other connectors
  • Clears connector_list when "none"/"null" is specified alone

This design allows users to explicitly disable CLI connectors when using custom --kv-transfer-config.


319-324: No fixes required — attribute is guaranteed to exist.

The kv_connector attribute is always present on KVTransferConfig objects in vLLM 0.11.0 (defined as kv_connector: str | None = None). Direct access is safe; even if the value is None, the comparison None == "NixlConnector" evaluates correctly to False without raising an error. The code requires no changes.

Likely an incorrect or invalid review comment.

container/Dockerfile.vllm (1)

203-204: Consistent libfabric integration in vLLM runtime image.

The additions properly follow the pattern for UCX/NIXL integration: installing libhwloc15, copying libfabric from dynamo_base, and extending LD_LIBRARY_PATH. Changes align with consistent framework-specific Dockerfile structure as noted in learnings.

Also applies to: 248-261

container/build.sh (1)

118-121: NIXL_LIBFABRIC_REF properly defined and propagated.

New libfabric reference variable is consistently defined and added to build arguments following the same pattern as NIXL_REF. Version v2.3.0 appears to be a stable upstream release.

Also applies to: 538-539

container/Dockerfile.trtllm (1)

282-286: Minor: Comment clarity—verify libfabric source stage.

Line 282's comment states "Copy Libfabric from framework image" but the COPY command sources from dynamo_base. While functionally correct, the comment should precisely match the source stage name for clarity. This mirrors a similar pattern in Dockerfile.vllm, but verify if both comments should be harmonized.

container/Dockerfile.sglang (1)

127-128: Consistent libfabric integration in SGLang runtime image.

Libfabric support properly integrated with appropriate package installation, library copying, and LD_LIBRARY_PATH updates. Note that libhwloc15 is installed in the framework stage rather than runtime stage—this aligns with SGLang's use of system Python rather than isolated venvs, which is acceptable.

Also applies to: 331-331, 343-343

container/Dockerfile (5)

279-306: Libfabric build configuration: verify --disable-verbs is intentional.

The libfabric build disables the verbs (InfiniBand) provider (line 292: --disable-verbs) while UCX is built WITH verbs support (line 267). This separation may be intentional—UCX provides verbs transport while libfabric focuses on EFA—but should be verified as a design choice. Confirm this configuration aligns with NIXL 0.7.1 requirements and cloud deployment expectations.


41-42: ARG propagation properly structured for libfabric support.

NIXL_LIBFABRIC_REF is correctly declared at the top level and redeclared in the base and wheel_builder stages, following the established pattern for NIXL configuration arguments. This ensures proper scoping across multi-stage builds.

Also applies to: 64-64, 150-150


428-430: Libfabric copy in dev stage properly integrated.

Libfabric is correctly copied alongside UCX and NIXL from wheel_builder stage with appropriate ownership and path handling. Comment clearly describes the addition.


319-320: No issues found. NIXL 0.7.1 includes libfabric support and the -Dlibfabric_path meson option is valid and documented. The Dockerfile correctly builds libfabric separately and passes the path to the NIXL meson build at line 320. The integration across all stages (build, wheel_builder, dev) is consistent and correct.


178-181: No issues with libfabric build dependencies—additions are correct and necessary.

libfabric v2.3.0 requires hwloc-devel and numactl-devel as build dependencies, and the Dockerfile additions at lines 178-181 are appropriate. libfabric auto-detects hwloc and libnuma during configure, so the lack of explicit --with-hwloc or --with-numactl flags in the configure block (lines 292-298) is correct. These packages are standard build requirements for libfabric and are not redundant—manylinux base images do not typically include development packages.

pyproject.toml (1)

57-58: vllm 0.11.0 dependency bump is compatible with codebase integration.

Verification confirms that vllm 0.11.0's KV transfer enhancements (engine_id requirement, kv_connector_extra_config support, and NixlConnector as first-class connector) are already fully implemented in the codebase:

  • engine_id is properly extracted and passed through KvConnectorLeader/KvConnectorWorker initialization (connector_leader.py, connector_worker.py)
  • NixlConnector is imported from the vllm 0.11.0 location and used in PdConnector for multi-connector setup (pd_connector.py)
  • kv_connector_extra_config is already used for multi-connector configuration (consolidator_config.py, args.py)
  • VLLM_NIXL environment variables are already configured in tests

No additional codebase updates are needed beyond the dependency bump.

@rmccorm4
Copy link
Contributor

Hi @chandlj, since #4317 was merged can you help pull in the latest updates and fix the merge conflicts?

Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
This reverts commit 1971265.

Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
@chandlj
Copy link
Contributor Author

chandlj commented Nov 21, 2025

Hi @chandlj, since #4317 was merged can you help pull in the latest updates and fix the merge conflicts?

@rmccorm4 done!

export CC=${CC} && \
export CXX=${CXX} && \
cd /usr/local/src && \
git clone https://github.com/ofiwg/libfabric.git && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we have to build libfabric from source instead of just installing the matching version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just mirroring how UCX was installed and NIXL upstream has changed the minimum NIXL version a few times so I wanted to be able to specify a version, but I'm sure either would work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I think libfabric provides pre-built packages, https://github.com/ofiwg/libfabric?tab=readme-ov-file#installing-pre-built-libfabric-packages, would prefer if we could just pin to the dpkg package instead of building from source. The reason we do it for UCX & NIXL today is since the headers are required for dynamo integration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nv-tusharma so I tried installing via dnf but the available libfabric version through that is out of date (v1.18) compared to the minimum required by NIXL (v1.21)

In NIXL's examples, they were building from source so that's what I went with, I can keep pursuing if this is a blocker.

Yet another alternative would be something along the lines of what llm-d is doing, which is using the efa installer. Not sure if this is actually required, but in any case it doesn't work in manylinux, so it would require significant changes to the Dockerfile.

vllm = [
"uvloop",
"nixl[cu12]<=0.7.1",
"vllm[flashinfer]==0.10.2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this wasn't updated already. Thanks for the catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, probably my fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants