-
Notifications
You must be signed in to change notification settings - Fork 706
feat: Libfabric support #4407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Libfabric support #4407
Conversation
|
👋 Hi chandlj! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThe PR enhances KV transfer configuration validation in vLLM arguments (adding normalization and "none"/"null" handling) and adds libfabric library support across build configuration and Dockerfiles. NIXL is updated from 0.7.0 to 0.7.1, and vllm[flashinfer] is pinned to 0.11.0. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~20 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/src/dynamo/vllm/args.py (1)
293-300: Consider adding a comment explaining the flag interaction design.The conflict detection logic is correct and enforces that users must explicitly specify
--connector nonewhen using--kv-transfer-config. This prevents accidental mixing of CLI connectors with custom transfer configs.Consider adding a comment above line 293 to clarify the design intent:
+ # When not using none/null, enforce that --connector and --kv-transfer-config + # are mutually exclusive. Users must explicitly use --connector none with + # --kv-transfer-config to avoid conflicts. else:This will help future maintainers understand why the conflict check is scoped to the else block.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
components/src/dynamo/vllm/args.py(2 hunks)container/Dockerfile(8 hunks)container/Dockerfile.sglang(3 hunks)container/Dockerfile.trtllm(3 hunks)container/Dockerfile.vllm(3 hunks)container/build.sh(2 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Applied to files:
container/Dockerfile.vllm
📚 Learning: 2025-09-03T01:10:12.599Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2822
File: container/Dockerfile.vllm:343-352
Timestamp: 2025-09-03T01:10:12.599Z
Learning: In the dynamo project's local-dev Docker targets, USER_UID and USER_GID build args are intentionally left without default values to force explicit UID/GID mapping during build time, preventing file permission issues in local development environments where container users need to match host user permissions for mounted volumes.
Applied to files:
container/Dockerfile
📚 Learning: 2025-07-30T00:34:35.810Z
Learnt from: dmitry-tokarev-nv
Repo: ai-dynamo/dynamo PR: 2179
File: docs/support_matrix.md:61-63
Timestamp: 2025-07-30T00:34:35.810Z
Learning: In docs/support_matrix.md, the NIXL version difference between runtime dependencies (0.5.0) and build dependencies (0.4.0) is intentional and expected, not an error that needs to be corrected.
Applied to files:
container/build.sh
🪛 Ruff (0.14.5)
components/src/dynamo/vllm/args.py
277-279: Avoid specifying long messages outside the exception class
(TRY003)
289-291: Avoid specifying long messages outside the exception class
(TRY003)
296-298: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (13)
components/src/dynamo/vllm/args.py (3)
272-279: LGTM! Clear validation with helpful error messages.The connector normalization and validation logic is well-implemented. The error message clearly indicates both the invalid connectors and the valid options.
Note: Static analysis suggests extracting the error messages into exception subclasses (TRY003), but this is a minor style preference and can be deferred.
281-292: LGTM! Proper handling of "none"/"null" connectors.The logic correctly:
- Detects user-provided kv_transfer_config with defensive checks
- Prevents mixing "none"/"null" with other connectors
- Clears connector_list when "none"/"null" is specified alone
This design allows users to explicitly disable CLI connectors when using custom
--kv-transfer-config.
319-324: No fixes required — attribute is guaranteed to exist.The
kv_connectorattribute is always present onKVTransferConfigobjects in vLLM 0.11.0 (defined askv_connector: str | None = None). Direct access is safe; even if the value isNone, the comparisonNone == "NixlConnector"evaluates correctly toFalsewithout raising an error. The code requires no changes.Likely an incorrect or invalid review comment.
container/Dockerfile.vllm (1)
203-204: Consistent libfabric integration in vLLM runtime image.The additions properly follow the pattern for UCX/NIXL integration: installing libhwloc15, copying libfabric from dynamo_base, and extending LD_LIBRARY_PATH. Changes align with consistent framework-specific Dockerfile structure as noted in learnings.
Also applies to: 248-261
container/build.sh (1)
118-121: NIXL_LIBFABRIC_REF properly defined and propagated.New libfabric reference variable is consistently defined and added to build arguments following the same pattern as NIXL_REF. Version v2.3.0 appears to be a stable upstream release.
Also applies to: 538-539
container/Dockerfile.trtllm (1)
282-286: Minor: Comment clarity—verify libfabric source stage.Line 282's comment states "Copy Libfabric from framework image" but the COPY command sources from
dynamo_base. While functionally correct, the comment should precisely match the source stage name for clarity. This mirrors a similar pattern in Dockerfile.vllm, but verify if both comments should be harmonized.container/Dockerfile.sglang (1)
127-128: Consistent libfabric integration in SGLang runtime image.Libfabric support properly integrated with appropriate package installation, library copying, and LD_LIBRARY_PATH updates. Note that libhwloc15 is installed in the framework stage rather than runtime stage—this aligns with SGLang's use of system Python rather than isolated venvs, which is acceptable.
Also applies to: 331-331, 343-343
container/Dockerfile (5)
279-306: Libfabric build configuration: verify --disable-verbs is intentional.The libfabric build disables the verbs (InfiniBand) provider (line 292:
--disable-verbs) while UCX is built WITH verbs support (line 267). This separation may be intentional—UCX provides verbs transport while libfabric focuses on EFA—but should be verified as a design choice. Confirm this configuration aligns with NIXL 0.7.1 requirements and cloud deployment expectations.
41-42: ARG propagation properly structured for libfabric support.NIXL_LIBFABRIC_REF is correctly declared at the top level and redeclared in the base and wheel_builder stages, following the established pattern for NIXL configuration arguments. This ensures proper scoping across multi-stage builds.
Also applies to: 64-64, 150-150
428-430: Libfabric copy in dev stage properly integrated.Libfabric is correctly copied alongside UCX and NIXL from wheel_builder stage with appropriate ownership and path handling. Comment clearly describes the addition.
319-320: No issues found. NIXL 0.7.1 includes libfabric support and the-Dlibfabric_pathmeson option is valid and documented. The Dockerfile correctly builds libfabric separately and passes the path to the NIXL meson build at line 320. The integration across all stages (build, wheel_builder, dev) is consistent and correct.
178-181: No issues with libfabric build dependencies—additions are correct and necessary.libfabric v2.3.0 requires hwloc-devel and numactl-devel as build dependencies, and the Dockerfile additions at lines 178-181 are appropriate. libfabric auto-detects hwloc and libnuma during configure, so the lack of explicit
--with-hwlocor--with-numactlflags in the configure block (lines 292-298) is correct. These packages are standard build requirements for libfabric and are not redundant—manylinux base images do not typically include development packages.pyproject.toml (1)
57-58: vllm 0.11.0 dependency bump is compatible with codebase integration.Verification confirms that vllm 0.11.0's KV transfer enhancements (engine_id requirement, kv_connector_extra_config support, and NixlConnector as first-class connector) are already fully implemented in the codebase:
- engine_id is properly extracted and passed through KvConnectorLeader/KvConnectorWorker initialization (connector_leader.py, connector_worker.py)
- NixlConnector is imported from the vllm 0.11.0 location and used in PdConnector for multi-connector setup (pd_connector.py)
- kv_connector_extra_config is already used for multi-connector configuration (consolidator_config.py, args.py)
- VLLM_NIXL environment variables are already configured in tests
No additional codebase updates are needed beyond the dependency bump.
09bafc8 to
d60b1e3
Compare
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
This reverts commit 1971265. Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
Signed-off-by: Joe Chandler <jchandler@glia-ai.com>
d60b1e3 to
eac85aa
Compare
| export CC=${CC} && \ | ||
| export CXX=${CXX} && \ | ||
| cd /usr/local/src && \ | ||
| git clone https://github.com/ofiwg/libfabric.git && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we have to build libfabric from source instead of just installing the matching version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just mirroring how UCX was installed and NIXL upstream has changed the minimum NIXL version a few times so I wanted to be able to specify a version, but I'm sure either would work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I think libfabric provides pre-built packages, https://github.com/ofiwg/libfabric?tab=readme-ov-file#installing-pre-built-libfabric-packages, would prefer if we could just pin to the dpkg package instead of building from source. The reason we do it for UCX & NIXL today is since the headers are required for dynamo integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nv-tusharma so I tried installing via dnf but the available libfabric version through that is out of date (v1.18) compared to the minimum required by NIXL (v1.21)
In NIXL's examples, they were building from source so that's what I went with, I can keep pursuing if this is a blocker.
Yet another alternative would be something along the lines of what llm-d is doing, which is using the efa installer. Not sure if this is actually required, but in any case it doesn't work in manylinux, so it would require significant changes to the Dockerfile.
| vllm = [ | ||
| "uvloop", | ||
| "nixl[cu12]<=0.7.1", | ||
| "vllm[flashinfer]==0.10.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this wasn't updated already. Thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, probably my fault.
Overview:
This PR updates the Dynamo Dockerfiles to build NIXL with libfabric support and install the necessary runtime packages. Builds off #4317 to allow users to set the
NixlConnectorbackend toLIBFABRICusing--kv-transfer-config.Details:
Builds off #4317 by updating each respective framework Dockerfile as well as the base Dockerfile to build NIXL v0.7.1 with Libfabric support. Also bumps the vLLM version to 0.11.0 to have support for setting
NixlConnectorbackends.Where should the reviewer start?
Dockerfile,Dockerfile.[framework],pyproject.tomlRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Relates to #4185
Summary by CodeRabbit