feat: PyTorchJobExecutor for Kubeflow Training Operator#461
feat: PyTorchJobExecutor for Kubeflow Training Operator#461svcnvidia-nemo-ci wants to merge 3 commits intomainfrom
Conversation
…etes Introduces PyTorchJobExecutor and a matching TorchX scheduler so users can deploy distributed PyTorchJobs to any Kubernetes cluster running the Kubeflow Training Operator via run.run() / run.Experiment. - PyTorchJobExecutor builds and submits PyTorchJob CRDs via the K8s API (local kubeconfig with in-cluster fallback) - cancel(wait=True) polls until both the CR and all associated pods are fully terminated - TorchX scheduler persists job state and maps PyTorchJobState -> AppState - Full TDD: tests written before implementation - Documentation added to docs/guides/execution.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| return None | ||
| executor.cancel(job_name) | ||
|
|
||
| def list(self) -> list[ListAppResponse]: ... |
Check notice
Code scanning / CodeQL
Statement has no effect Note
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 10 days ago
In general, to fix a “statement has no effect” caused by a bare ... in a concrete method, replace the ellipsis with a meaningful implementation or, if the method is intentionally unsupported, with an explicit raise NotImplementedError (or similar) so the intent is clear and the code has an observable effect.
Here, PyTorchJobScheduler.list is declared but unimplemented. Without changing existing functionality (i.e., without guessing at how to list jobs), the safest, least-invasive fix is to replace the ... body with a raise NotImplementedError explaining that listing is not yet supported for PyTorchJobScheduler. This turns the no-op ellipsis into a deliberate runtime error if the method is called, which is standard practice for unimplemented interface methods.
Specifically, in nemo_run/run/torchx_backend/schedulers/pytorchjob.py, at the def list method around line 197, replace the entire line def list(self) -> list[ListAppResponse]: ... with a multi-line method definition:
def list(self) -> list[ListAppResponse]:
raise NotImplementedError("Listing apps is not implemented for PyTorchJobScheduler.")No new imports or helper methods are required.
| @@ -194,7 +194,8 @@ | ||
| return None | ||
| executor.cancel(job_name) | ||
|
|
||
| def list(self) -> list[ListAppResponse]: ... | ||
| def list(self) -> list[ListAppResponse]: | ||
| raise NotImplementedError("Listing apps is not implemented for PyTorchJobScheduler.") | ||
|
|
||
| def _validate(self, app: AppDef, scheduler: str) -> None: | ||
| pass |
Aligns the constructor parameter name with the plan spec and example.py. The field now shadows the base-class method, which is intentional: PyTorchJob specifies parallelism in spec.nprocPerNode, not via the TorchX Torchrun launcher machinery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- launch() gains wait=True/timeout/poll_interval: blocks until RUNNING, SUCCEEDED, or FAILED — callers no longer need to poll manually - fetch_logs: stream=False uses subprocess.run (tolerates pods still initializing); stream=True uses Popen + generator, matching DGXCloud streaming behaviour - local/example.py: full e2e cycle — launch(wait=True), poll logs until sentinel 'NEMO_TEST_OK' appears, assert, cancel(wait=True) - 46/46 tests pass; verified against real cluster Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Closing to recreate under correct author. |
Summary
PyTorchJobExecutorthat builds and submits PyTorchJob CRDs to a Kubernetes cluster running the Kubeflow Training Operatorrun.run()andrun.Experimentcancel(wait=True)polls until both the CR and all associated pods are fully terminatedTest plan
uv run -- pytest test/core/execution/test_pytorchjob.py test/run/torchx_backend/schedulers/test_pytorchjob.py -vpasses (44/44)uv run --group lint -- ruff check --fix . && uv run --group lint -- ruff format .cleanlaunch()→status()→cancel(wait=True)cycle against a real cluster🤖 Generated with Claude Code