MBridge: using gpus-per-node from system by podkidyshev · Pull Request #847 · NVIDIA/cloudai

podkidyshev · 2026-03-25T11:18:06Z

Summary

MegatronBridge.gpus_per_node is not longer a required arg
uses system's config gpus_per_node if available
overall the arg isn't required for MBridge, they maintain their own default

Test Plan

Automated CI

Additional Notes

Closes CLOUDAI-32

coderabbitai · 2026-03-25T11:18:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2607326e-543c-4b00-8da1-476c436184ae

📥 Commits

Reviewing files that changed from the base of the PR and between d8c55fa and 9821af0.

📒 Files selected for processing (1)

tests/test_acceptance.py

💤 Files with no reviewable changes (1)

tests/test_acceptance.py

📝 Walkthrough

Walkthrough

Removed the gpus_per_node CLI field from MegatronBridge; SLURM launcher generation now reads GPUs-per-node from the configured system. Tests and example TOML configs were updated to remove CLI overrides; reference sbatch and tests adjusted to expect system-driven GPU-per-node behavior.

Changes

Cohort / File(s)	Summary
CLI model `src/cloudai/workloads/megatron_bridge/megatron_bridge.py`	Deleted `gpus_per_node` from `MegatronBridgeCmdArgs` (field removed from the CLI argument model).
SLURM command generation `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`	Switched GPU-per-node sources from `args.gpus_per_node` to `self.system.gpus_per_node` for `-gn` and `--additional_slurm_params` (keeps same conditional emission logic).
Tests `tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`	Replaced CLI-override test with parametrized cases driven by `configured_slurm_system.gpus_per_node`; asserts presence/absence and contents of GPU directives accordingly.
Example configs `conf/experimental/megatron_bridge/test/.../megatron_bridge_qwen_30b.toml`	Removed `gpus_per_node` entries from multiple TOML test configs so system configuration supplies GPUs-per-node.
Reference job `tests/ref_data/megatron-bridge.sbatch`	Updated expected SLURM invocation to use 8 GPUs per node (`-gn 8` and `gpus-per-node=8;gres=gpu:8`).
Acceptance test `tests/test_acceptance.py`	Removed `gpus_per_node` override from the `megatron-bridge` test definition so it relies on system configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I’m a rabbit, I hopped through code tonight,
The gpus_per_node slipped out of sight. 🐇
The system now hums the count to the run,
Tests and sbatch dance—job neatly done,
I nibble a carrot and vanish with delight.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: shifting the source of gpus_per_node configuration from MegatronBridge command arguments to the system configuration.
Description check	✅ Passed	The description is directly related to the changeset, explaining that gpus_per_node is no longer required as an argument and now uses the system's configuration instead.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/mbrdige-gpus

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 380-410: The test test_gpus_per_node reuses the same output_subdir
("out_gpus") for all parametrized cases which can cause cross-test interference;
update the test to supply a unique output_subdir per case by adding a distinct
value to the parameter set or computing one from the inputs (e.g., include
cmd_args_gpus_per_node and system_gpus_per_node in the subdir name) when calling
make_test_run so each invocation creates/uses its own directory; adjust the
parametrize tuple or the call site where make_test_run(...) is invoked to use
that unique subdirectory.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a9f4c7f8-41d4-49b2-a082-37db19e94f2b

📥 Commits

Reviewing files that changed from the base of the PR and between e330d84 and 8279eab.

📒 Files selected for processing (3)

src/cloudai/workloads/megatron_bridge/megatron_bridge.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py (2)
403-411: 🧹 Nitpick | 🔵 Trivial

Consider cleaning up stale cmd_args_overrides parameter.

The test still passes cmd_args_overrides={"gpus_per_node": 2} (line 407), but with the PR changes, gpus_per_node in command args is no longer used—the system configuration value is used instead. This override has no effect on the test outcome.

To improve test clarity, consider:

Removing the stale override: make_test_run(output_subdir="out_no_gpu_directives")

Explicitly setting configured_slurm_system.gpus_per_node to verify the directive is skipped regardless of the system value
♻️ Suggested clarification
     def test_gpus_per_node_skipped_when_gpu_directives_unsupported(
         self, configured_slurm_system: SlurmSystem, make_test_run: Callable[..., TestRun]
     ) -> None:
         configured_slurm_system.supports_gpu_directives_cache = False
-        tr = make_test_run(cmd_args_overrides={"gpus_per_node": 2}, output_subdir="out_no_gpu_directives")
+        configured_slurm_system.gpus_per_node = 4  # Set explicitly to verify directive is skipped
+        tr = make_test_run(output_subdir="out_no_gpu_directives")
         cmd_gen = MegatronBridgeSlurmCommandGenStrategy(configured_slurm_system, tr)
         wrapper_content = self._wrapper_content(cmd_gen)
-        assert "gpus-per-node=2" not in wrapper_content
-        assert "gres=gpu:2" not in wrapper_content
+        assert "gpus-per-node=" not in wrapper_content
+        assert "gres=gpu:" not in wrapper_content
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` around
lines 403 - 411, Update the
test_gpus_per_node_skipped_when_gpu_directives_unsupported test to remove the
stale cmd_args_overrides by calling make_test_run without the {"gpus_per_node":
2} override and instead explicitly set configured_slurm_system.gpus_per_node to
a representative value (e.g., 2 or 4) before constructing
MegatronBridgeSlurmCommandGenStrategy; keep
configured_slurm_system.supports_gpu_directives_cache = False and assert that
"gpus-per-node=..." and "gres=gpu:..." do not appear in the wrapper_content to
verify the system-level GPU setting is ignored when GPU directives are
unsupported.
77-77: ⚠️ Potential issue | 🟡 Minor

Remove unused gpus_per_node from test fixture and test overrides.

The gpus_per_node field at line 77 (and line 407 in test_gpus_per_node_skipped_when_gpu_directives_unsupported) is not a valid field in MegatronBridgeCmdArgs. While it's silently accepted due to CmdArgs having extra="allow", it's never used and creates confusion. Remove it from the fixture and test overrides to clarify that gpus_per_node is a system-level config, not a command argument.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py` at line
77, Remove the unused "gpus_per_node" key from the test fixture and any test
overrides that build or pass arguments for MegatronBridgeCmdArgs (e.g., the
fixture used in test_command_gen_strategy_slurm and the override in
test_gpus_per_node_skipped_when_gpu_directives_unsupported); since
MegatronBridgeCmdArgs does not define gpus_per_node (CmdArgs allows extras),
delete those entries and any assertions expecting it so the tests only use valid
MegatronBridgeCmdArgs fields.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/ref_data/megatron-bridge.sbatch`:
- Line 22: Remove the unused gpus_per_node entry from the test configuration in
test_acceptance.py so the test no longer supplies an extra field that
MegatronBridgeCmdArgs ignores; keep num_gpus (or num_gpus=8) as the
authoritative GPU count used to generate the sbatch line, and update any related
test expectations/comments to reflect 8 GPUs rather than the removed
gpus_per_node value; do not change MegatronBridgeCmdArgs (which uses
ConfigDict(extra="allow")), only remove the redundant gpus_per_node key from the
test data.

---

Outside diff comments:
In `@tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py`:
- Around line 403-411: Update the
test_gpus_per_node_skipped_when_gpu_directives_unsupported test to remove the
stale cmd_args_overrides by calling make_test_run without the {"gpus_per_node":
2} override and instead explicitly set configured_slurm_system.gpus_per_node to
a representative value (e.g., 2 or 4) before constructing
MegatronBridgeSlurmCommandGenStrategy; keep
configured_slurm_system.supports_gpu_directives_cache = False and assert that
"gpus-per-node=..." and "gres=gpu:..." do not appear in the wrapper_content to
verify the system-level GPU setting is ignored when GPU directives are
unsupported.
- Line 77: Remove the unused "gpus_per_node" key from the test fixture and any
test overrides that build or pass arguments for MegatronBridgeCmdArgs (e.g., the
fixture used in test_command_gen_strategy_slurm and the override in
test_gpus_per_node_skipped_when_gpu_directives_unsupported); since
MegatronBridgeCmdArgs does not define gpus_per_node (CmdArgs allows extras),
delete those entries and any assertions expecting it so the tests only use valid
MegatronBridgeCmdArgs fields.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d455467-ff08-4fb5-a671-96bdeabbee73

📥 Commits

Reviewing files that changed from the base of the PR and between 545da11 and d8c55fa.

📒 Files selected for processing (4)

src/cloudai/workloads/megatron_bridge/megatron_bridge.py
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
tests/ref_data/megatron-bridge.sbatch
tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py

💤 Files with no reviewable changes (1)

src/cloudai/workloads/megatron_bridge/megatron_bridge.py

tests/ref_data/megatron-bridge.sbatch

using gpus-per-node from system if available

8279eab

podkidyshev requested review from amaslenn, jeffnvidia and srivatsankrishnan as code owners March 25, 2026 11:18

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

tests/workloads/megatron_bridge/test_command_gen_strategy_slurm.py Outdated Show resolved Hide resolved

podkidyshev self-assigned this Mar 25, 2026

update mbridge configs to use system gpus-per-node

545da11

amaslenn previously approved these changes Mar 25, 2026

View reviewed changes

remove gpus-per-node from cmdargs

d8c55fa

podkidyshev dismissed amaslenn’s stale review via d8c55fa March 25, 2026 12:16

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

tests/ref_data/megatron-bridge.sbatch Show resolved Hide resolved

remove unused param

9821af0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBridge: using gpus-per-node from system#847

MBridge: using gpus-per-node from system#847
podkidyshev wants to merge 4 commits intomainfrom
ipod/mbrdige-gpus

podkidyshev commented Mar 25, 2026

Uh oh!

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented Mar 25, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 25, 2026 •

edited

Loading