feat: PyTorchJobExecutor for Kubeflow Training Operator by svcnvidia-nemo-ci · Pull Request #461 · NVIDIA-NeMo/Run

svcnvidia-nemo-ci · 2026-03-12T15:06:41Z

Summary

Adds PyTorchJobExecutor that builds and submits PyTorchJob CRDs to a Kubernetes cluster running the Kubeflow Training Operator
Pairs with a TorchX scheduler so jobs integrate with run.run() and run.Experiment
Kubernetes config loaded automatically (local kubeconfig → in-cluster fallback)
cancel(wait=True) polls until both the CR and all associated pods are fully terminated
Full TDD: tests written before implementation; covers happy path, error cases, state mapping, and persistence

Test plan

uv run -- pytest test/core/execution/test_pytorchjob.py test/run/torchx_backend/schedulers/test_pytorchjob.py -v passes (44/44)
uv run --group lint -- ruff check --fix . && uv run --group lint -- ruff format . clean
End-to-end: launch() → status() → cancel(wait=True) cycle against a real cluster

🤖 Generated with Claude Code

…etes Introduces PyTorchJobExecutor and a matching TorchX scheduler so users can deploy distributed PyTorchJobs to any Kubernetes cluster running the Kubeflow Training Operator via run.run() / run.Experiment. - PyTorchJobExecutor builds and submits PyTorchJob CRDs via the K8s API (local kubeconfig with in-cluster fallback) - cancel(wait=True) polls until both the CR and all associated pods are fully terminated - TorchX scheduler persists job state and maps PyTorchJobState -> AppState - Full TDD: tests written before implementation - Documentation added to docs/guides/execution.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nemo_run/run/torchx_backend/schedulers/pytorchjob.py

+            return None
+        executor.cancel(job_name)
+
+    def list(self) -> list[ListAppResponse]: ...


In general, to fix a “statement has no effect” caused by a bare ... in a concrete method, replace the ellipsis with a meaningful implementation or, if the method is intentionally unsupported, with an explicit raise NotImplementedError (or similar) so the intent is clear and the code has an observable effect.

Here, PyTorchJobScheduler.list is declared but unimplemented. Without changing existing functionality (i.e., without guessing at how to list jobs), the safest, least-invasive fix is to replace the ... body with a raise NotImplementedError explaining that listing is not yet supported for PyTorchJobScheduler. This turns the no-op ellipsis into a deliberate runtime error if the method is called, which is standard practice for unimplemented interface methods.

Specifically, in nemo_run/run/torchx_backend/schedulers/pytorchjob.py, at the def list method around line 197, replace the entire line def list(self) -> list[ListAppResponse]: ... with a multi-line method definition:

def list(self) -> list[ListAppResponse]: raise NotImplementedError("Listing apps is not implemented for PyTorchJobScheduler.")

No new imports or helper methods are required.

Aligns the constructor parameter name with the plan spec and example.py. The field now shadows the base-class method, which is intentional: PyTorchJob specifies parallelism in spec.nprocPerNode, not via the TorchX Torchrun launcher machinery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- launch() gains wait=True/timeout/poll_interval: blocks until RUNNING, SUCCEEDED, or FAILED — callers no longer need to poll manually - fetch_logs: stream=False uses subprocess.run (tolerates pods still initializing); stream=True uses Popen + generator, matching DGXCloud streaming behaviour - local/example.py: full e2e cycle — launch(wait=True), poll logs until sentinel 'NEMO_TEST_OK' appears, assert, cancel(wait=True) - 46/46 tests pass; verified against real cluster Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ko3n1g · 2026-03-12T15:29:27Z

Closing to recreate under correct author.

svcnvidia-nemo-ci had a problem deploying to public March 12, 2026 15:07 — with GitHub Actions Failure

svcnvidia-nemo-ci temporarily deployed to public March 12, 2026 15:07 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

ko3n1g had a problem deploying to public March 12, 2026 15:16 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 15:16 — with GitHub Actions Inactive

ko3n1g closed this Mar 12, 2026

ko3n1g had a problem deploying to public March 12, 2026 15:30 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 15:30 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PyTorchJobExecutor for Kubeflow Training Operator#461

feat: PyTorchJobExecutor for Kubeflow Training Operator#461
svcnvidia-nemo-ci wants to merge 3 commits intomainfrom
feat/pytorchjob-executor

svcnvidia-nemo-ci commented Mar 12, 2026

Uh oh!

Check notice

Copilot Autofix

ko3n1g commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

svcnvidia-nemo-ci commented Mar 12, 2026

Summary

Test plan

Uh oh!

Check notice

Copilot Autofix

ko3n1g commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants