Skip to content

MBridge: using gpus-per-node from system#847

Open
podkidyshev wants to merge 4 commits intomainfrom
ipod/mbrdige-gpus
Open

MBridge: using gpus-per-node from system#847
podkidyshev wants to merge 4 commits intomainfrom
ipod/mbrdige-gpus

Conversation

@podkidyshev
Copy link
Contributor

Summary

  • MegatronBridge.gpus_per_node is not longer a required arg
  • uses system's config gpus_per_node if available
  • overall the arg isn't required for MBridge, they maintain their own default

Test Plan

  • Automated CI

Additional Notes

  • Closes CLOUDAI-32

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2607326e-543c-4b00-8da1-476c436184ae

📥 Commits

Reviewing files that changed from the base of the PR and between d8c55fa and 9821af0.

📒 Files selected for processing (1)
  • tests/test_acceptance.py
💤 Files with no reviewable changes (1)
  • tests/test_acceptance.py

📝 Walkthrough

Walkthrough

Removed the gpus_per_node CLI field from MegatronBridge; SLURM launcher generation now reads GPUs-per-node from the configured system. Tests and example TOML configs were updated to remove CLI overrides; reference sbatch and tests adjusted to expect system-driven GPU-per-node behavior.

Changes

Cohort / File(s) Summary
CLI model
src/cloudai/workloads/megatron_bridge/megatron_bridge.py
Deleted gpus_per_node from MegatronBridgeCmdArgs (field removed from the CLI argument model).
SLURM command generation
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Switched GPU-per-node sources from args.gpus_per_node to self.system.gpus_per_node for -gn and --additional_slurm_params (keeps same conditional emission logic).
Tests
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py
Replaced CLI-override test with parametrized cases driven by configured_slurm_system.gpus_per_node; asserts presence/absence and contents of GPU directives accordingly.
Example configs
conf/experimental/megatron_bridge/test/.../megatron_bridge_qwen_30b.toml
Removed gpus_per_node entries from multiple TOML test configs so system configuration supplies GPUs-per-node.
Reference job
tests/ref_data/megatron-bridge.sbatch
Updated expected SLURM invocation to use 8 GPUs per node (-gn 8 and gpus-per-node=8;gres=gpu:8).
Acceptance test
tests/test_acceptance.py
Removed gpus_per_node override from the megatron-bridge test definition so it relies on system configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I’m a rabbit, I hopped through code tonight,
The gpus_per_node slipped out of sight. 🐇
The system now hums the count to the run,
Tests and sbatch dance—job neatly done,
I nibble a carrot and vanish with delight.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: shifting the source of gpus_per_node configuration from MegatronBridge command arguments to the system configuration.
Description check ✅ Passed The description is directly related to the changeset, explaining that gpus_per_node is no longer required as an argument and now uses the system's configuration instead.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/mbrdige-gpus

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 380-410: The test test_gpus_per_node reuses the same output_subdir
("out_gpus") for all parametrized cases which can cause cross-test interference;
update the test to supply a unique output_subdir per case by adding a distinct
value to the parameter set or computing one from the inputs (e.g., include
cmd_args_gpus_per_node and system_gpus_per_node in the subdir name) when calling
make_test_run so each invocation creates/uses its own directory; adjust the
parametrize tuple or the call site where make_test_run(...) is invoked to use
that unique subdirectory.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a9f4c7f8-41d4-49b2-a082-37db19e94f2b

📥 Commits

Reviewing files that changed from the base of the PR and between e330d84 and 8279eab.

📒 Files selected for processing (3)
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

@podkidyshev podkidyshev self-assigned this Mar 25, 2026
amaslenn
amaslenn previously approved these changes Mar 25, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py (2)

403-411: 🧹 Nitpick | 🔵 Trivial

Consider cleaning up stale cmd_args_overrides parameter.

The test still passes cmd_args_overrides={"gpus_per_node": 2} (line 407), but with the PR changes, gpus_per_node in command args is no longer used—the system configuration value is used instead. This override has no effect on the test outcome.

To improve test clarity, consider:

  1. Removing the stale override: make_test_run(output_subdir="out_no_gpu_directives")
  2. Explicitly setting configured_slurm_system.gpus_per_node to verify the directive is skipped regardless of the system value
♻️ Suggested clarification
     def test_gpus_per_node_skipped_when_gpu_directives_unsupported(
         self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
     ) -> None:
         configured_slurm_system.supports_gpu_directives_cache = False
-        tr = make_test_run(cmd_args_overrides={"gpus_per_node": 2}, output_subdir="out_no_gpu_directives")
+        configured_slurm_system.gpus_per_node = 4  # Set explicitly to verify directive is skipped
+        tr = make_test_run(output_subdir="out_no_gpu_directives")
         cmd_gen = MegatronBridgeSlurmCommandGenStrategy(configured_slurm_system, tr)
         wrapper_content = self._wrapper_content(cmd_gen)
-        assert "gpus-per-node=2" not in wrapper_content
-        assert "gres=gpu:2" not in wrapper_content
+        assert "gpus-per-node=" not in wrapper_content
+        assert "gres=gpu:" not in wrapper_content
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around
lines 403 - 411, Update the
test_gpus_per_node_skipped_when_gpu_directives_unsupported test to remove the
stale cmd_args_overrides by calling make_test_run without the {"gpus_per_node":
2} override and instead explicitly set configured_slurm_system.gpus_per_node to
a representative value (e.g., 2 or 4) before constructing
MegatronBridgeSlurmCommandGenStrategy; keep
configured_slurm_system.supports_gpu_directives_cache = False and assert that
"gpus-per-node=..." and "gres=gpu:..." do not appear in the wrapper_content to
verify the system-level GPU setting is ignored when GPU directives are
unsupported.

77-77: ⚠️ Potential issue | 🟡 Minor

Remove unused gpus_per_node from test fixture and test overrides.

The gpus_per_node field at line 77 (and line 407 in test_gpus_per_node_skipped_when_gpu_directives_unsupported) is not a valid field in MegatronBridgeCmdArgs. While it's silently accepted due to CmdArgs having extra="allow", it's never used and creates confusion. Remove it from the fixture and test overrides to clarify that gpus_per_node is a system-level config, not a command argument.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` at line
77, Remove the unused "gpus_per_node" key from the test fixture and any test
overrides that build or pass arguments for MegatronBridgeCmdArgs (e.g., the
fixture used in test_command_gen_strategy_slurm and the override in
test_gpus_per_node_skipped_when_gpu_directives_unsupported); since
MegatronBridgeCmdArgs does not define gpus_per_node (CmdArgs allows extras),
delete those entries and any assertions expecting it so the tests only use valid
MegatronBridgeCmdArgs fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/ref_data/megatron-bridge.sbatch`:
- Line 22: Remove the unused gpus_per_node entry from the test configuration in
test_acceptance.py so the test no longer supplies an extra field that
MegatronBridgeCmdArgs ignores; keep num_gpus (or num_gpus=8) as the
authoritative GPU count used to generate the sbatch line, and update any related
test expectations/comments to reflect 8 GPUs rather than the removed
gpus_per_node value; do not change MegatronBridgeCmdArgs (which uses
ConfigDict(extra="allow")), only remove the redundant gpus_per_node key from the
test data.

---

Outside diff comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 403-411: Update the
test_gpus_per_node_skipped_when_gpu_directives_unsupported test to remove the
stale cmd_args_overrides by calling make_test_run without the {"gpus_per_node":
2} override and instead explicitly set configured_slurm_system.gpus_per_node to
a representative value (e.g., 2 or 4) before constructing
MegatronBridgeSlurmCommandGenStrategy; keep
configured_slurm_system.supports_gpu_directives_cache = False and assert that
"gpus-per-node=..." and "gres=gpu:..." do not appear in the wrapper_content to
verify the system-level GPU setting is ignored when GPU directives are
unsupported.
- Line 77: Remove the unused "gpus_per_node" key from the test fixture and any
test overrides that build or pass arguments for MegatronBridgeCmdArgs (e.g., the
fixture used in test_command_gen_strategy_slurm and the override in
test_gpus_per_node_skipped_when_gpu_directives_unsupported); since
MegatronBridgeCmdArgs does not define gpus_per_node (CmdArgs allows extras),
delete those entries and any assertions expecting it so the tests only use valid
MegatronBridgeCmdArgs fields.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d455467-ff08-4fb5-a671-96bdeabbee73

📥 Commits

Reviewing files that changed from the base of the PR and between 545da11 and d8c55fa.

📒 Files selected for processing (4)
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/ref_data/megatron-bridge.sbatch
  • tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py
💤 Files with no reviewable changes (1)
  • src/cloudai/workloads/megatron_bridge/megatron_bridge.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants