Skip to content

Support CNI spec for NCCL over k8s#848

Merged
amaslenn merged 1 commit intomainfrom
am/cloudai-3/nccl-cni-spec
Mar 25, 2026
Merged

Support CNI spec for NCCL over k8s#848
amaslenn merged 1 commit intomainfrom
am/cloudai-3/nccl-cni-spec

Conversation

@amaslenn
Copy link
Contributor

Summary

Support CNI spec for NCCL over k8s.

Test Plan

  1. CI (extended)
  2. Manual runs on dev k8s cluster.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Kubernetes CNI networking support was added to the system. A new use_host_network configuration field and resolution methods were introduced to KubernetesSystem. NCCL test JSON generation was refactored to resolve and apply discovered CNI networks to pod specifications. Comprehensive test coverage was added for the new functionality.

Changes

Cohort / File(s) Summary
Kubernetes System Enhancement
src/cloudai/systems/kubernetes/kubernetes_system.py
Added use_host_network boolean field, get_network_attachment_definitions() method to query kubectl for available networks, and resolve_cni_networks() method to determine which CNI networks to use based on configuration.
NCCL Test JSON Generation
src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py
Refactored MPIJob generation to resolve CNI networks and pass them to pod spec builders. Updated launcher/worker specs to conditionally set hostNetwork based on network availability, added pod annotation for CNI networks, and extended resource preparation to request per-network NIC resources.
Test Fixtures & Coverage
tests/conftest.py, tests/systems/kubernetes/test_system.py, tests/workloads/nccl_test/test_json_gen_strategy_kubernetes.py
Updated test fixture to pass use_host_network=True. Added comprehensive test suites for network attachment definition discovery, CNI network resolution logic, and NCCL JSON generation under both CNI and host-network modes with appropriate pod spec validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

🐰 Hops of joy! The networks now bloom,
CNI discovery banishes the gloom,
Pod specs dance with network flair,
Host or CNI—we choose with care! 🌐✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Support CNI spec for NCCL over k8s' directly and clearly summarizes the main objective of the changeset: adding CNI specification support for NCCL networking in Kubernetes.
Description check ✅ Passed The description is directly related to the changeset, explaining the addition of CNI specification support for NCCL over Kubernetes with a clear test plan, though it is brief.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am/cloudai-3/nccl-cni-spec

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/kubernetes/kubernetes_system.py`:
- Around line 136-151: The get_network_attachment_definitions function assumes
every item has metadata.name and metadata.namespace which can raise KeyError on
malformed items; update the list construction to defensively access keys (e.g.,
use item.get("metadata", {}) and metadata.get("name")/get("namespace")) or wrap
per-item extraction in a try/except that skips and logs malformed entries,
returning only well-formed "namespace/name" strings to preserve the existing
graceful-degradation behavior.

In `@src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py`:
- Around line 217-222: The loop handling cni_networks can raise IndexError
because net.split("/",1)[1] assumes a "/" exists; update the loop in
kubernetes_json_gen_strategy.py where cni_networks is processed (the block
referencing cni_networks, nic_name, and resources["requests"]/["limits"]) to
defensively parse net: either use str.partition("/") and take the second part if
present, or check for "/" and skip or use the whole string as the nic_name
fallback; ensure you do not crash on malformed entries and still populate
resources keys only when a valid nic_name is obtained (reference
resolve_cni_networks() and upstream get_network_attachment_definitions() for
expected format).

In `@tests/systems/kubernetes/test_system.py`:
- Line 20: Replace typing.Dict and typing.List with the built-in generic types
to modernize annotations: update the import line that currently imports "Dict,
List" from typing to remove them (keep ClassVar if still used) and change all
annotations that use Dict[...] and List[...] to dict[...] and list[...],
including the function signature that currently uses Dict/List so it matches the
file's existing use of list[str] and other built-in generics.

In `@tests/workloads/nccl_test/test_json_gen_strategy_kubernetes.py`:
- Around line 180-185: The test test_launcher_never_gets_nic_resources uses a
startswith check that assumes NIC resource keys begin with "nic"—make the
assertion explicit by deriving expected resource keys from CNI_NETS (e.g., map
each entry in CNI_NETS like "namespace/nic-name" to the resource key
"nvidia.com/{nic-name}") and assert none of those exact resource keys appear in
launcher_resources["requests"]; locate this logic around gen_json, CNI_NETS,
payload and launcher_resources in the test and replace the startswith-based
any(...) check with an exact-match check against the constructed keys.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 49efdc8e-0540-423b-bc2b-345785e56f2b

📥 Commits

Reviewing files that changed from the base of the PR and between a32d394 and 92956da.

📒 Files selected for processing (5)
  • src/cloudai/systems/kubernetes/kubernetes_system.py
  • src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py
  • tests/conftest.py
  • tests/systems/kubernetes/test_system.py
  • tests/workloads/nccl_test/test_json_gen_strategy_kubernetes.py

@amaslenn amaslenn requested a review from podkidyshev March 25, 2026 15:44
@amaslenn amaslenn merged commit 96f185e into main Mar 25, 2026
5 checks passed
@amaslenn amaslenn deleted the am/cloudai-3/nccl-cni-spec branch March 25, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants