feat: add KubeflowExecutor for Kubeflow Training Operator (TrainJob CRD)#462
Open
feat: add KubeflowExecutor for Kubeflow Training Operator (TrainJob CRD)#462
Conversation
435b65c to
8f943bc
Compare
8f943bc to
9b2921d
Compare
9b2921d to
6f63116
Compare
8c8d0a2 to
7469b52
Compare
7469b52 to
dd6ed75
Compare
…ent, test cleanup) - Add explanatory comment to empty AttributeError except in _get_job_dirs (backwards-compat field migration — absence is expected and handled) - Add noqa + comment to Jinja2 Environment for shell-script template (autoescape intentionally disabled for .sh/.j2; no XSS risk) - Remove unused _raise_on_read helper in test_fetch_logs_stream_handles_exception - Use sys.modules lookup instead of duplicate import in test_import_error_when_kubernetes_unavailable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
chtruong814
reviewed
Mar 13, 2026
Contributor
chtruong814
left a comment
There was a problem hiding this comment.
Awesome. I think we should drop the v1 support. It's deprecated. Had some other feedback. Please take a look.
nemo_run/core/execution/kubeflow.py
Outdated
| runtime_ref: str = "torch-distributed" | ||
| namespace: str = "default" | ||
| image: str = "" | ||
| num_nodes: int = 2 |
Contributor
There was a problem hiding this comment.
Why would num_nodes default to 2?
| try: | ||
| cfg = serializer.deserialize(app["executor"]) | ||
| # Backwards compat: migrate renamed field nproc_per_node → nprocs_per_node. | ||
| # AttributeError means the field doesn't exist so no migration is needed. |
Contributor
There was a problem hiding this comment.
Is this because of the v1 vs v2 spec difference?
Contributor
Author
There was a problem hiding this comment.
not sure.. might be something we can reduce.. i'll check
Contributor
Author
There was a problem hiding this comment.
okay that was a local caching issue, this backward compat isn't required anymore
Signed-off-by: oliver könig <okoenig@nvidia.com>
PyTorchJob (Training Operator v1) is deprecated in favour of TrainJob (Training Operator v2). Simplify the executor to support TrainJob only: - Remove PyTorchJob constants, `job_kind` field, `_get_pytorchjob_body`, and all PyTorchJob branches in `_group`, `_version`, `_plural`, `_pod_label_selector`, `get_job_body`, and `status`. - Inline the trivial `_group()`, `_version()`, `_plural()`, and `_pod_label_selector()` helpers; callers now reference the `_TRAINJOB_*` constants and the label-selector format string directly. - Rename `_get_trainjob_body` → `get_job_body` (drop the one-line wrapper). - Remove backwards-compat `nproc_per_node → nprocs_per_node` migration block in `schedulers/kubeflow.py` (only relevant for legacy PyTorchJob persisted state). - Add docstrings to all public methods that lacked them. - Update tests: remove PyTorchJob-specific tests, drop `job_kind="TrainJob"` params (now the only kind), fix status/launch-wait fixtures to use the TrainJob `jobsStatus` format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…or state Persisted entries in ~/.nemo_run/.kubeflow_jobs.json written before PyTorchJob was removed still carry job_kind in their serialized Fiddle config. Strip it before fdl.build() to avoid a TypeError on status polling and log fetching for those old runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
…utor state Entries written before the nproc_per_node → nprocs_per_node rename still exist in ~/.nemo_run/.kubeflow_jobs.json. Migrate the value and drop the old key alongside the existing job_kind removal so both old field names are handled in one place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
kubectl enforces a default max of 5 concurrent log requests when using a label selector. Pass --max-log-requests=num_nodes so fetch_logs works correctly for larger jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
chtruong814
previously approved these changes
Mar 17, 2026
hemildesai
reviewed
Mar 17, 2026
Comment on lines
+146
to
+148
| elif fn_or_script.inline and role_args: | ||
| # Inline scripts are written to a file; role_args[0] is the pod-side path | ||
| script = role_args[0] |
Contributor
There was a problem hiding this comment.
will this work for slurm as well?
Replace the brittle `lines_yielded > 0` and 10-minute deadline heuristics with `status()`-based termination: the retry loop now runs until the job reaches SUCCEEDED or FAILED, handling slow container pulls, mid-stream crashes, and transient network failures correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PyTorchJob vs TrainJob
Notable fields
Minimal E2E example
```python
import nemo_run as run
from nemo_run.core.execution.kubeflow import KubeflowExecutor
executor = KubeflowExecutor(
namespace="my-namespace",
image="pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime",
num_nodes=2,
gpus_per_node=8,
launcher=run.Torchrun(), # torchrun args injected automatically
volumes=[{"name": "data", "persistentVolumeClaim": {"claimName": "my-pvc"}}],
volume_mounts=[{"name": "data", "mountPath": "/data"}],
)
script = run.Script("train.py")
run.run(script, executor=executor, name="my-training-job")
```
Test plan
🤖 Generated with Claude Code