feat: add KubeflowExecutor for Kubeflow Training Operator (TrainJob CRD) by ko3n1g · Pull Request #462 · NVIDIA-NeMo/Run

ko3n1g · 2026-03-12T16:16:00Z

Summary

Adds `KubeflowExecutor` that submits distributed training jobs to any Kubernetes cluster running the Kubeflow Training Operator
Supports both PyTorchJob (Training Operator v1) and TrainJob (Training Operator v2) via a `job_kind` toggle
Pairs with a TorchX scheduler so jobs integrate with `run.run()` and `run.Experiment`
Kubernetes config loaded automatically (local kubeconfig → in-cluster fallback)

PyTorchJob vs TrainJob

	PyTorchJob	TrainJob
API	`kubeflow.org/v1`	`trainer.kubeflow.org/v1alpha1`
Pod config	directly in replica pod spec	`podTemplateOverrides[].spec`
`nproc`	`spec.nprocPerNode`	`spec.trainer.numProcPerNode`

Notable fields

`tolerations`, `affinity` — go into pod spec / `podTemplateOverrides` automatically
`env_list` — full env var dicts supporting `valueFrom` / `secretKeyRef`
`pod_spec_overrides` — arbitrary extra pod spec fields (e.g. `resourceClaims` for IMEX channels)
`launch(wait=True)` — polls until `RUNNING` / `SUCCEEDED` / `FAILED`
`cancel(wait=True)` — polls until CR gone and all pods terminated
`UNKNOWN`/`None` status → `AppState.PENDING` (avoids false failures on transient API errors)

Minimal E2E example

```python
import nemo_run as run
from nemo_run.core.execution.kubeflow import KubeflowExecutor

executor = KubeflowExecutor(
namespace="my-namespace",
image="pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime",
num_nodes=2,
gpus_per_node=8,
launcher=run.Torchrun(), # torchrun args injected automatically
volumes=[{"name": "data", "persistentVolumeClaim": {"claimName": "my-pvc"}}],
volume_mounts=[{"name": "data", "mountPath": "/data"}],
)

script = run.Script("train.py")

run.run(script, executor=executor, name="my-training-job")
```

Test plan

63 unit tests passing (`pytest test/core/execution/test_kubeflow.py test/run/torchx_backend/schedulers/test_kubeflow.py`)
PyTorchJob e2e verified against AWS EKS (`local/example.py`): launch → RUNNING → log sentinel → cancel(wait=True)
TrainJob e2e pending GKE cluster readiness (`local/example_trainjob.py`)

🤖 Generated with Claude Code

nemo_run/run/torchx_backend/schedulers/kubeflow.py

nemo_run/core/execution/kubeflow.py

test/core/execution/test_kubeflow.py

…ent, test cleanup) - Add explanatory comment to empty AttributeError except in _get_job_dirs (backwards-compat field migration — absence is expected and handled) - Add noqa + comment to Jinja2 Environment for shell-script template (autoescape intentionally disabled for .sh/.j2; no XSS risk) - Remove unused _raise_on_read helper in test_fetch_logs_stream_handles_exception - Use sys.modules lookup instead of duplicate import in test_import_error_when_kubernetes_unavailable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

chtruong814

Awesome. I think we should drop the v1 support. It's deprecated. Had some other feedback. Please take a look.

docs/guides/execution.md

nemo_run/core/execution/kubeflow.py

chtruong814 · 2026-03-13T20:17:07Z

nemo_run/core/execution/kubeflow.py

+    runtime_ref: str = "torch-distributed"
+    namespace: str = "default"
+    image: str = ""
+    num_nodes: int = 2


Why would num_nodes default to 2?

nemo_run/core/execution/kubeflow.py

chtruong814 · 2026-03-13T20:22:18Z

nemo_run/run/torchx_backend/schedulers/kubeflow.py

+        try:
+            cfg = serializer.deserialize(app["executor"])
+            # Backwards compat: migrate renamed field nproc_per_node → nprocs_per_node.
+            # AttributeError means the field doesn't exist so no migration is needed.


Is this because of the v1 vs v2 spec difference?

not sure.. might be something we can reduce.. i'll check

okay that was a local caching issue, this backward compat isn't required anymore

Signed-off-by: oliver könig <okoenig@nvidia.com>

PyTorchJob (Training Operator v1) is deprecated in favour of TrainJob (Training Operator v2). Simplify the executor to support TrainJob only: - Remove PyTorchJob constants, `job_kind` field, `_get_pytorchjob_body`, and all PyTorchJob branches in `_group`, `_version`, `_plural`, `_pod_label_selector`, `get_job_body`, and `status`. - Inline the trivial `_group()`, `_version()`, `_plural()`, and `_pod_label_selector()` helpers; callers now reference the `_TRAINJOB_*` constants and the label-selector format string directly. - Rename `_get_trainjob_body` → `get_job_body` (drop the one-line wrapper). - Remove backwards-compat `nproc_per_node → nprocs_per_node` migration block in `schedulers/kubeflow.py` (only relevant for legacy PyTorchJob persisted state). - Add docstrings to all public methods that lacked them. - Update tests: remove PyTorchJob-specific tests, drop `job_kind="TrainJob"` params (now the only kind), fix status/launch-wait fixtures to use the TrainJob `jobsStatus` format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…or state Persisted entries in ~/.nemo_run/.kubeflow_jobs.json written before PyTorchJob was removed still carry job_kind in their serialized Fiddle config. Strip it before fdl.build() to avoid a TypeError on status polling and log fetching for those old runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…utor state Entries written before the nproc_per_node → nprocs_per_node rename still exist in ~/.nemo_run/.kubeflow_jobs.json. Migrate the value and drop the old key alongside the existing job_kind removal so both old field names are handled in one place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

kubectl enforces a default max of 5 concurrent log requests when using a label selector. Pass --max-log-requests=num_nodes so fetch_logs works correctly for larger jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

hemildesai · 2026-03-17T21:42:51Z

nemo_run/run/torchx_backend/packaging.py

+        elif fn_or_script.inline and role_args:
+            # Inline scripts are written to a file; role_args[0] is the pod-side path
+            script = role_args[0]


will this work for slurm as well?

Replace the brittle `lines_yielded > 0` and 10-minute deadline heuristics with `status()`-based termination: the retry loop now runs until the job reaches SUCCEEDED or FAILED, handling slow container pulls, mid-stream crashes, and transient network failures correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g had a problem deploying to public March 12, 2026 16:17 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:17 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

nemo_run/run/torchx_backend/schedulers/kubeflow.py Fixed Show fixed Hide fixed

ko3n1g force-pushed the feat/pytorchjob-executor branch from 435b65c to 8f943bc Compare March 12, 2026 16:19

ko3n1g requested review from chtruong814, hemildesai and theothermike March 12, 2026 16:20

ko3n1g had a problem deploying to public March 12, 2026 16:20 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:20 — with GitHub Actions Inactive

ko3n1g force-pushed the feat/pytorchjob-executor branch from 8f943bc to 9b2921d Compare March 12, 2026 16:22

ko3n1g had a problem deploying to public March 12, 2026 16:23 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:23 — with GitHub Actions Inactive

ko3n1g force-pushed the feat/pytorchjob-executor branch from 9b2921d to 6f63116 Compare March 12, 2026 16:25

ko3n1g had a problem deploying to public March 12, 2026 16:26 — with GitHub Actions Failure

ko3n1g temporarily deployed to public March 12, 2026 16:27 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 12, 2026 16:36 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 13, 2026 09:31 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 13, 2026

View reviewed changes

nemo_run/core/execution/kubeflow.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 13, 2026 09:41 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 13, 2026 09:52 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 13, 2026

View reviewed changes

nemo_run/core/execution/kubeflow.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 13, 2026 10:03 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 13, 2026 10:09 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 13, 2026

View reviewed changes

nemo_run/core/execution/kubeflow.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 13, 2026 10:14 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 13, 2026 10:17 — with GitHub Actions Inactive

ko3n1g force-pushed the feat/pytorchjob-executor branch from 8c8d0a2 to 7469b52 Compare March 13, 2026 10:56

ko3n1g temporarily deployed to public March 13, 2026 10:58 — with GitHub Actions Inactive

ko3n1g force-pushed the feat/pytorchjob-executor branch from 7469b52 to dd6ed75 Compare March 13, 2026 11:04

ko3n1g temporarily deployed to public March 13, 2026 11:06 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 13, 2026

View reviewed changes

test/core/execution/test_kubeflow.py Fixed Show fixed Hide fixed

test/core/execution/test_kubeflow.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 13, 2026 16:27 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 13, 2026 16:49 — with GitHub Actions Inactive

chtruong814 reviewed Mar 13, 2026

View reviewed changes

ko3n1g and others added 4 commits March 16, 2026 16:18

feedback

446bead

Signed-off-by: oliver könig <okoenig@nvidia.com>

num_nodes: int = 1

e10da5e

Signed-off-by: oliver könig <okoenig@nvidia.com>

alpine

ff80ab6

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public March 16, 2026 16:50 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 16, 2026 17:23 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 16, 2026 18:28 — with GitHub Actions Inactive

chore: remove backwards compat shims for job_kind and nproc_per_node

4e1a307

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public March 16, 2026 18:45 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 16, 2026 20:20 — with GitHub Actions Inactive

ko3n1g added 2 commits March 17, 2026 21:31

comment on chmod

495a7ad

Signed-off-by: oliver könig <okoenig@nvidia.com>

typo

bc629c0

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public March 17, 2026 21:35 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 17, 2026 21:37 — with GitHub Actions Inactive

chtruong814 previously approved these changes Mar 17, 2026

View reviewed changes

ko3n1g changed the title ~~feat: add KubeflowExecutor for Kubeflow Training Operator (PyTorchJob + TrainJob)~~ feat: add KubeflowExecutor for Kubeflow Training Operator (TrainJob CRD) Mar 17, 2026

hemildesai reviewed Mar 17, 2026

View reviewed changes

ko3n1g dismissed chtruong814’s stale review via 39523c4 March 18, 2026 08:35

ko3n1g temporarily deployed to public March 18, 2026 08:37 — with GitHub Actions Inactive

chore: ruff format test_kubeflow.py

7474d3b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g deployed to public March 18, 2026 09:46 — with GitHub Actions Active

Conversation

ko3n1g commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

PyTorchJob vs TrainJob

Notable fields

Minimal E2E example

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chtruong814 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chtruong814 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chtruong814 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ko3n1g Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

ko3n1g Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hemildesai Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ko3n1g commented Mar 12, 2026 •

edited

Loading