diff --git a/docs/guides/architecture.md b/docs/guides/architecture.md new file mode 100644 index 00000000..fcd1287c --- /dev/null +++ b/docs/guides/architecture.md @@ -0,0 +1,147 @@ +# Architecture + +> **Audience**: Contributors adding new executors, and users who want to understand why something is failing or how to extend NeMo-Run. +> +> **Prerequisite**: Read [Execution](execution.md), at least one [executor guide](executors/index.md), and [Management](management.md) first. + +## `run.run()` vs `run.Experiment` + +`run.run()` is a thin convenience wrapper. Internally it creates an `Experiment` with a single task and `detach=False`: + +```python +# These two are equivalent +run.run(task, executor=executor) + +with run.Experiment("untitled") as exp: + exp.add(task, executor=executor) + exp.run(detach=False) +``` + +All the mechanics described below apply to both. + +--- + +## Call chain + +```{mermaid} +flowchart TD + A["exp.run()"] --> B["Experiment._prepare()"] + B --> C["Job.prepare()"] + C --> D["executor.assign(exp_id, exp_dir, task_id, task_dir)"] + C --> E["executor.create_job_dir()"] + C --> F["package(task, executor) → AppDef + Role(s)"] + A --> G["Job.launch(runner)"] + G --> H["runner.dryrun(AppDef, scheduler_name, cfg=executor)"] + H --> I["scheduler.submit_dryrun(AppDef, executor)"] + G --> J["runner.schedule(dryrun_info)"] + J --> K["scheduler.schedule(dryrun_info) → AppHandle"] +``` + +1. `_prepare()` calls `Job.prepare()` for each task, which assigns experiment/job directories, syncs code, and builds the TorchX `AppDef`. +2. `Job.launch(runner)` calls `runner.dryrun()` to validate the submission plan, then `runner.schedule()` to submit it. +3. The `AppHandle` returned by `scheduler.schedule()` is stored in the experiment metadata so `Experiment.from_id()` can reconnect. + +--- + +## Executor → TorchX scheduler mapping + +Each executor is backed by a TorchX scheduler registered as an entry point in `pyproject.toml` under `torchx.schedulers`: + +| Executor | TorchX Scheduler | +|----------|-----------------| +| `LocalExecutor` | `local_persistent` | +| `DockerExecutor` | `docker_persistent` | +| `SlurmExecutor` | `slurm_tunnel` | +| `SkypilotExecutor` | `skypilot` | +| `SkypilotJobsExecutor` | `skypilot_jobs` | +| `DGXCloudExecutor` | `dgx_cloud` | +| `LeptonExecutor` | `lepton` | + +Schedulers are discovered at runtime via `torchx.schedulers.get_scheduler_factories()`. + +--- + +## Key TorchX types + +| Type | What it represents | +|------|--------------------| +| `AppDef` | Full application: list of `Role`s + metadata | +| `Role` | One execution unit: entrypoint, args, env, image, num_replicas, resources | +| `AppDryRunInfo` | Validated `AppDef` + submission plan (can be inspected without running) | +| `AppHandle` | Running job ID: `"{scheduler}://{runner}/{app_id}"` | +| `AppState` | Status enum: `RUNNING`, `SUCCEEDED`, `FAILED`, `CANCELLED`, `UNKNOWN` | + +--- + +## How `Executor` fields map to TorchX concepts + +| Executor field | TorchX mapping | +|----------------|---------------| +| `nnodes()` + `nproc_per_node()` | `Role.num_replicas` + replica topology | +| `launcher` | `AppDef` structure (`torchrun` / `ft` / basic entrypoint) | +| `retries` | `Role.max_retries` | +| `env_vars` | `Role.env` | +| `packager` | Pre-launch code sync strategy | +| `assign(exp_id, exp_dir, task_id, task_dir)` | Sets path metadata consumed by the scheduler | + +--- + +## Metadata storage layout + +All experiment metadata is written under `NEMORUN_HOME` (default `~/.nemo_run`): + +``` +~/.nemo_run/experiments/{title}/{title}_{exp_id}/ +├── {task_id}/ +│ ├── configs/ +│ │ ├── {task_id}_executor.yaml # serialised executor config +│ │ ├── {task_id}_fn_or_script # zlib-JSON encoded task +│ │ └── {task_id}_packager # zlib-JSON encoded packager +│ └── scripts/{task_id}.sh # generated sbatch/shell script +└── .tasks # serialised Job metadata (JSON) +``` + +`Experiment.from_id()` reads `.tasks` to reconstruct the experiment and reattach to live jobs via the stored `AppHandle`. + +--- + +## Adding a new executor + +1. **Subclass `Executor`** in `nemo_run/core/execution/`: + + ```python + from nemo_run.core.execution.base import Executor + + @dataclass + class MyExecutor(Executor): + my_param: str = "default" + ... + ``` + +2. **Implement a TorchX `Scheduler`** in `nemo_run/run/torchx_backend/schedulers/`: + + ```python + from torchx.schedulers import Scheduler + + class MyScheduler(Scheduler): + def submit_dryrun(self, app, cfg): ... + def schedule(self, dryrun_info): ... + def describe(self, app_id): ... + def cancel(self, app_id): ... + ``` + +3. **Register the scheduler as an entry point** in `pyproject.toml`: + + ```toml + [project.entry-points."torchx.schedulers"] + my_scheduler = "nemo_run.run.torchx_backend.schedulers.my:create_scheduler" + ``` + +4. **Add to `EXECUTOR_MAPPING`** in `nemo_run/run/torchx_backend/schedulers/api.py`: + + ```python + EXECUTOR_MAPPING = { + ..., + MyExecutor: "my_scheduler", + } + ``` diff --git a/docs/guides/execution.md b/docs/guides/execution.md index 6f2c0063..12245a3d 100644 --- a/docs/guides/execution.md +++ b/docs/guides/execution.md @@ -122,216 +122,18 @@ hybrid_packager = run.HybridPackager( This would create an archive where the contents of `src` are under a `code/` directory and matched `configs/*.yaml` files are under a `configs/` directory. -### Defining Executors +### Executor guides -Next, We'll describe details on setting up each of the executors below. +For per-executor prerequisites, configuration reference, and end-to-end examples see the **[Executors](executors/index.md)** section: -#### LocalExecutor +- [LocalExecutor](executors/local.md) +- [DockerExecutor](executors/docker.md) +- [SlurmExecutor](executors/slurm.md) +- [SkypilotExecutor](executors/skypilot.md) +- [DGXCloudExecutor](executors/dgxcloud.md) +- [LeptonExecutor](executors/lepton.md) +- [KubeRayExecutor](executors/kuberay.md) -The LocalExecutor is the simplest executor. It executes your task locally in a separate process or group from your current working directory. +Defining executors in Python offers great flexibility — you can mix and match common environment variables, and the separation of tasks from executors lets you run the same `run.Script` on any supported backend. -The easiest way to define one is to call `run.LocalExecutor()`. - -#### DockerExecutor - -The DockerExecutor enables launching a task using `docker` on your local machine. It requires `docker` to be installed and running as a prerequisite. - -The DockerExecutor uses the [docker python client](https://docker-py.readthedocs.io/en/stable/) and most of the options are passed directly to the client. - -Below is an example of configuring a Docker Executor - -```python -run.DockerExecutor( - container_image="python:3.12", - num_gpus=-1, - runtime="nvidia", - ipc_mode="host", - shm_size="30g", - volumes=["/local/path:/path/in/container"], - env_vars={"PYTHONUNBUFFERED": "1"}, - packager=run.Packager(), -) -``` - -#### SlurmExecutor - -The SlurmExecutor enables launching the configured task on a Slurm Cluster with Pyxis. Additionally, you can configure a `run.SSHTunnel`, which enables you to execute tasks on the Slurm cluster from your local machine while NeMo-Run manages the SSH connection for you. This setup supports use cases such as launching the same task on multiple Slurm clusters. - -Below is an example of configuring a Slurm Executor - -```python -def your_slurm_executor(nodes: int = 1, container_image: str = DEFAULT_IMAGE): - # SSH Tunnel - ssh_tunnel = run.SSHTunnel( - host="your-slurm-host", - user="your-user", - job_dir="directory-to-store-runs-on-the-slurm-cluster", - identity="optional-path-to-your-key-for-auth", - ) - # Local Tunnel to use if you're already on the cluster - local_tunnel = run.LocalTunnel() - - packager = GitArchivePackager( - # This will also be the working directory in your task. - # If empty, the working directory will be toplevel of your git repo - subpath="optional-subpath-from-toplevel-of-your-git-repo" - ) - - executor = run.SlurmExecutor( - # Most of these parameters are specific to slurm - account="your-account", - partition="your-partition", - ntasks_per_node=8, - gpus_per_node=8, - nodes=nodes, - tunnel=ssh_tunnel, - container_image=container_image, - time="00:30:00", - env_vars=common_envs(), - container_mounts=mounts_for_your_hubs(), - packager=packager, - ) - -# You can then call the executor in your script like -executor = your_slurm_cluster(nodes=8, container_image="your-nemo-image") -``` - -Use the SSH Tunnel when launching from your local machine, or the Local Tunnel if you're already on the Slurm cluster. - -##### Job Dependencies - -`SlurmExecutor` supports defining dependencies between [jobs](management.md#add-tasks), allowing you to create workflows where jobs run in a specific order. Additionally, you can specify the `dependency_type` parameter: - -```python -executor = run.SlurmExecutor( - # ... other parameters ... - dependency_type="afterok", -) -``` - -The `dependency_type` parameter specifies the type of dependency relationship: - -- `afterok` (default): Job will start only after the specified jobs have completed successfully -- `afterany`: Job will start after the specified jobs have terminated (regardless of exit code) -- `afternotok`: Job will start after the specified jobs have failed -- Other options are available as defined in the [Slurm documentation](https://slurm.schedmd.com/sbatch.html#OPT_dependency) - -This functionality enables you to create complex workflows with proper orchestration between different tasks, such as starting a training job only after data preparation is complete, or running an evaluation only after training finishes successfully. - -#### SkypilotExecutor - -This executor is used to configure [Skypilot](https://skypilot.readthedocs.io/en/latest/docs/index.html). Make sure Skypilot is installed using `pip install "nemo_run[skypilot]"` and atleast one cloud is configured using `sky check`. - -Here's an example of the `SkypilotExecutor` for Kubernetes: - -```python -def your_skypilot_executor(nodes: int, devices: int, container_image: str): - return SkypilotExecutor( - gpus="RTX5880-ADA-GENERATION", - gpus_per_node=devices, - num_nodes=nodes, - env_vars=common_envs(), - container_image=container_image, - cloud="kubernetes", - # Optional to reuse Skypilot cluster - cluster_name="tester", - setup=""" - conda deactivate - nvidia-smi - ls -al ./ - """, - ) - -# You can then call the executor in your script like -executor = your_skypilot_cluster(nodes=8, devices=8, container_image="your-nemo-image") -``` - -As demonstrated in the examples, defining executors in Python offers great flexibility. You can easily mix and match things like common environment variables, and the separation of tasks from executors enables you to run the same configured task on any supported executor. - -#### DGXCloudExecutor - -The `DGXCloudExecutor` integrates with a DGX Cloud cluster's Run:ai API to launch distributed jobs. It uses REST API calls to authenticate, identify the target project and cluster, and submit the job specification. - -```{warning} -Currently, the `DGXCloudExecutor` is only supported when launching experiments *from* a pod running on the DGX Cloud cluster itself. Furthermore, this launching pod must have access to a Persistent Volume Claim (PVC) where the experiment/job directories will be created, and this same PVC must also be configured to be mounted by the job being launched. -``` - -Here's an example configuration: - -```python -def your_dgx_executor(nodes: int, gpus_per_node: int, container_image: str): - # Ensure these are set correctly for your DGX Cloud environment - # You might fetch these from environment variables or a config file - base_url = "YOUR_DGX_CLOUD_API_ENDPOINT" # e.g., https://./api/v1 - app_id = "YOUR_RUNAI_APP_ID" - app_secret = "YOUR_RUNAI_APP_SECRET" - project_name = "YOUR_RUNAI_PROJECT_NAME" - # Define the PVC that will be mounted in the job pods - # Ensure the path specified here contains your NEMORUN_HOME - pvc_name = "your-pvc-k8s-name" # The Kubernetes name of the PVC - pvc_mount_path = "/your_custom_path" # The path where the PVC will be mounted inside the container - - executor = run.DGXCloudExecutor( - base_url=base_url, - app_id=app_id, - app_secret=app_secret, - project_name=project_name, - container_image=container_image, - nodes=nodes, - gpus_per_node=gpus_per_node, - pvcs=[{"name": pvc_name, "path": pvc_mount_path}], - # Optional: Add custom environment variables or Slurm specs if needed - env_vars=common_envs(), - # packager=run.GitArchivePackager() # Choose appropriate packager - ) - return executor - -# Example usage: -# executor = your_dgx_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image") - -``` - -For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html). - -#### LeptonExecutor - -The `LeptonExecutor` integrates with an NVIDIA DGX Cloud Lepton cluster's Python SDK to launch distributed jobs. It uses API calls behind the Lepton SDK to authenticate, identify the target node group and resource shapes, and submit the job specification which will be launched as a batch job on the cluster. - -Here's an example configuration: - -```python -def your_lepton_executor(nodes: int, gpus_per_node: int, container_image: str): - # Ensure these are set correctly for your DGX Cloud environment - # You might fetch these from environment variables or a config file - resource_shape = "gpu.8xh100-80gb" # Replace with your desired resource shape representing the number of GPUs in a pod - node_group = "my-node-group" # The node group to run the job in - nemo_run_dir = "/nemo-workspace/nemo-run" # The NeMo-Run directory where experiments are saved - # Define the remote storage directory that will be mounted in the job pods - # Ensure the path specified here contains your NEMORUN_HOME - storage_path = "/nemo-workspace" # The remote storage directory to mount in jobs - mount_path = "/nemo-workspace" # The path where the remote storage directory will be mounted inside the container - - executor = run.LeptonExecutor( - resource_shape=resource_shape, - node_group=node_group, - container_image=container_image, - nodes=nodes, - nemo_run_dir=nemo_run_dir, - gpus_per_node=gpus_per_node, - mounts=[{"path": storage_path, "mount_path": mount_path}], - # Optional: Add custom environment variables or PyTorch specs if needed - env_vars=common_envs(), - # Optional: Specify a node reservation to schedule jobs with - # node_reservation="my-node-reservation", - # Optional: Specify commands to run at container launch prior to the job starting - # pre_launch_commands=["nvidia-smi"], - # Optional: Specify image pull secrets for authenticating with container registries - # image_pull_secrets=["my-image-pull-secret"], - # packager=run.GitArchivePackager() # Choose appropriate packager - ) - return executor - -# Example usage: -executor = your_lepton_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image") - -``` +For a deep dive into how executors map to TorchX schedulers and how `run.Experiment` orchestrates execution, see [Architecture](architecture.md). diff --git a/docs/guides/executors/dgxcloud.md b/docs/guides/executors/dgxcloud.md new file mode 100644 index 00000000..a866207e --- /dev/null +++ b/docs/guides/executors/dgxcloud.md @@ -0,0 +1,89 @@ +# DGXCloudExecutor + +Launch distributed jobs on NVIDIA DGX Cloud via the Run:ai API. + +## Prerequisites + +```{warning} +`DGXCloudExecutor` is currently only supported when launching experiments *from a pod running on the DGX Cloud cluster itself*. The launching pod must have access to a Persistent Volume Claim (PVC), and the same PVC must be mounted by the launched job. +``` + +You need: +- Access to a DGX Cloud cluster and a Run:ai project +- A Run:ai application ID and secret (create one in the Run:ai console under **Application credentials**) +- A PVC accessible from both the launching pod and the job pods +- `NEMORUN_HOME` pointing to a path on the PVC + +## Executor configuration + +```python +import nemo_run as run + +executor = run.DGXCloudExecutor( + base_url="https://./api/v1", # Run:ai API endpoint + app_id="YOUR_RUNAI_APP_ID", + app_secret="YOUR_RUNAI_APP_SECRET", + project_name="YOUR_RUNAI_PROJECT_NAME", + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + nodes=1, + gpus_per_node=8, + pvcs=[{ + "name": "your-pvc-k8s-name", # Kubernetes PVC name + "path": "/your_custom_path", # mount path inside the container + }], + env_vars={"PYTHONUNBUFFERED": "1"}, +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `base_url` | Run:ai REST API base URL | +| `app_id` / `app_secret` | Run:ai application credentials | +| `project_name` | Run:ai project to submit jobs to | +| `container_image` | Container image URI | +| `nodes` | Number of nodes | +| `gpus_per_node` | GPUs per node | +| `pvcs` | List of PVC mounts (`name` + `path`) | + +## E2E workflow + +```python +import nemo_run as run + +task = run.Script("python train.py --lr=3e-4 --max-steps=500") + +executor = run.DGXCloudExecutor( + base_url="https://my-cluster.example.com/api/v1", + app_id="my-app-id", + app_secret="my-app-secret", + project_name="my-project", + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + nodes=1, + gpus_per_node=8, + pvcs=[{"name": "my-pvc", "path": "/workspace"}], +) + +with run.Experiment("my-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=True) + +# Later — reconnect and check status +experiment = run.Experiment.from_id("my-experiment_") +experiment.status() +experiment.logs("training") +``` + +## Advanced options + +### Package code from git + +```python +executor = run.DGXCloudExecutor( + ..., + packager=run.GitArchivePackager(subpath="src"), +) +``` + +For a complete end-to-end NeMo example see the [NVIDIA DGX Cloud NeMo E2E Workflow](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html). diff --git a/docs/guides/executors/docker.md b/docs/guides/executors/docker.md new file mode 100644 index 00000000..149fc506 --- /dev/null +++ b/docs/guides/executors/docker.md @@ -0,0 +1,88 @@ +# DockerExecutor + +Run tasks inside a Docker container on your local machine. + +## Prerequisites + +- Docker Engine installed and running (`docker info` should succeed) +- The `docker` Python package (installed automatically with NeMo-Run) + +## Executor configuration + +```python +import nemo_run as run + +executor = run.DockerExecutor( + container_image="python:3.12", # any accessible image + num_gpus=-1, # -1 = all GPUs; 0 = CPU-only + runtime="nvidia", # omit for CPU-only workloads + ipc_mode="host", + shm_size="30g", + volumes=["/local/path:/path/in/container"], + env_vars={"PYTHONUNBUFFERED": "1"}, + packager=run.Packager(), # passthrough packager +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `container_image` | Docker image to use (required) | +| `num_gpus` | Number of GPUs to expose; `-1` = all | +| `runtime` | Container runtime (`"nvidia"` for GPU support) | +| `ipc_mode` | IPC namespace mode (`"host"` for multi-GPU NCCL) | +| `shm_size` | Shared memory size | +| `volumes` | Host–container path bindings | +| `packager` | How to sync code into the container | + +## E2E workflow + +```python +import nemo_run as run + +task = run.Script("python train.py --lr=3e-4 --max-steps=500") + +executor = run.DockerExecutor( + container_image="python:3.12", + packager=run.Packager(), +) + +with run.Experiment("my-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=False) + +exp.status() +exp.logs("training") +``` + +## Advanced options + +### Package your code into the container + +Use `GitArchivePackager` to bundle committed code from your repo: + +```python +executor = run.DockerExecutor( + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + packager=run.GitArchivePackager(subpath="src"), + num_gpus=-1, + runtime="nvidia", +) +``` + +The packaged archive is mounted at the working directory inside the container. + +### Torchrun for multi-GPU jobs + +```python +executor = run.DockerExecutor( + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + num_gpus=-1, + runtime="nvidia", + ipc_mode="host", + shm_size="16g", + launcher="torchrun", + ntasks_per_node=8, +) +``` diff --git a/docs/guides/executors/index.md b/docs/guides/executors/index.md new file mode 100644 index 00000000..9c32f2fa --- /dev/null +++ b/docs/guides/executors/index.md @@ -0,0 +1,58 @@ +# Executors + +:::{toctree} +:maxdepth: 1 +:hidden: + +local +docker +slurm +skypilot +dgxcloud +lepton +kuberay +::: + +An **execution unit** is a (task, executor) pair. The task defines *what* to run; the executor defines *where and how*. NeMo-Run keeps these two concerns separate so you can swap executors without changing your task configuration. + +## Choose an executor + +Pick the executor that matches your environment: + +| Executor | When to use | Setup cost | +|----------|-------------|-----------| +| [LocalExecutor](local.md) | Prototyping, debugging, CI | None — works out of the box | +| [DockerExecutor](docker.md) | Reproducible local runs, container-based workflows | Docker installed & running | +| [SlurmExecutor](slurm.md) | HPC clusters with Slurm and Pyxis | SSH access to a Slurm cluster | +| [SkypilotExecutor](skypilot.md) | Multi-cloud: AWS, GCP, Azure, Kubernetes | `pip install nemo_run[skypilot]` + cloud credentials | +| [DGXCloudExecutor](dgxcloud.md) | NVIDIA DGX Cloud via Run:ai | Pod access + PVC on DGX Cloud | +| [LeptonExecutor](lepton.md) | NVIDIA DGX Cloud Lepton (standard execution) | Lepton CLI installed & authenticated | +| [KubeRayExecutor](kuberay.md) | Ray workloads on Kubernetes | kubectl + KubeRay operator | + +## Packager support matrix + +The packager controls how your code is bundled and sent to the execution environment. + +| Executor | Packagers | +|----------|-----------| +| LocalExecutor | `run.Packager` (passthrough) | +| DockerExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| SlurmExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| SkypilotExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| DGXCloudExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| LeptonExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | + +See [Execution — Packagers](../execution.md#packagers) for a description of each packager. + +## Launcher support + +The launcher controls how the process is started inside the executor. + +| Launcher | Flag | Description | +|----------|------|-------------| +| Default | `None` | Direct subprocess — no special launcher | +| Torchrun | `"torchrun"` / `run.Torchrun(...)` | Distributed training via `torchrun` | +| Fault Tolerance | `"ft"` / `run.core.execution.FaultTolerance(...)` | NVIDIA fault-tolerant launcher | +| SlurmRay | `"slurm_ray"` | Ray cluster on Slurm (see [ray.md](../ray.md)) | + +See [Execution — Launchers](../execution.md#launchers) for details. diff --git a/docs/guides/executors/kuberay.md b/docs/guides/executors/kuberay.md new file mode 100644 index 00000000..c2a38341 --- /dev/null +++ b/docs/guides/executors/kuberay.md @@ -0,0 +1,129 @@ +# KubeRayExecutor + +Configure Ray clusters and jobs on Kubernetes via the [KubeRay operator](https://ray-project.github.io/kuberay/). + +```{note} +`KubeRayExecutor` is not used directly with `run.Experiment`. It is passed to `RayCluster` and `RayJob` helpers. For the full Ray workflow see [Ray Clusters & Jobs](../ray.md). +``` + +## Prerequisites + +- `kubectl` configured with access to your Kubernetes cluster (`kubectl cluster-info` should succeed) +- KubeRay operator installed in the cluster +- A container image with Ray installed (e.g. `anyscale/ray:2.43.0-py312-cu125`) + +## Executor configuration + +```python +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup + +executor = KubeRayExecutor( + namespace="my-k8s-namespace", + ray_version="2.43.0", + image="anyscale/ray:2.43.0-py312-cu125", + head_cpu="4", + head_memory="12Gi", + worker_groups=[ + KubeRayWorkerGroup( + group_name="worker", + replicas=2, + gpus_per_worker=8, + ) + ], + env_vars={ + "HF_HOME": "/workspace/hf_cache", + }, +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `namespace` | Kubernetes namespace for Ray resources | +| `ray_version` | Ray version string (must match the image) | +| `image` | Ray container image | +| `head_cpu` / `head_memory` | Resources for the head pod | +| `worker_groups` | List of `KubeRayWorkerGroup` definitions | + +`KubeRayWorkerGroup` parameters: + +| Parameter | Description | +|-----------|-------------| +| `group_name` | Arbitrary name for the worker group | +| `replicas` | Number of worker pods | +| `gpus_per_worker` | GPUs per worker pod | + +## E2E workflow + +Use `KubeRayExecutor` with `RayCluster` and `RayJob` from `nemo_run.run.ray`: + +```python +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup +from nemo_run.run.ray.cluster import RayCluster +from nemo_run.run.ray.job import RayJob + +executor = KubeRayExecutor( + namespace="ml-team", + ray_version="2.43.0", + image="anyscale/ray:2.43.0-py312-cu125", + worker_groups=[ + KubeRayWorkerGroup(group_name="worker", replicas=2, gpus_per_worker=8), + ], +) + +# 1. Start the cluster +cluster = RayCluster(name="my-kuberay-cluster", executor=executor) +cluster.start(timeout=900) +cluster.port_forward(port=8265, target_port=8265, wait=False) # dashboard + +# 2. Submit a job +job = RayJob(name="my-job", executor=executor) +job.start( + command="python train.py --config cfgs/train.yaml", + workdir="/path/to/project/", +) +job.logs(follow=True) + +# 3. Clean up +cluster.stop() +``` + +## Advanced options + +### Persistent volume mounts + +```python +executor = KubeRayExecutor( + ..., + volume_mounts=[{"name": "workspace", "mountPath": "/workspace"}], + volumes=[{ + "name": "workspace", + "persistentVolumeClaim": {"claimName": "my-workspace-pvc"}, + }], + reuse_volumes_in_worker_groups=True, # also mount PVCs on workers +) +``` + +### Custom scheduler (e.g. Run:ai) + +```python +executor = KubeRayExecutor( + ..., + spec_kwargs={"schedulerName": "runai-scheduler"}, +) +``` + +### Pre-Ray commands + +Commands injected into head and worker containers before Ray starts: + +```python +cluster.start( + timeout=900, + pre_ray_start_commands=[ + "pip install uv", + "echo 'unset RAY_RUNTIME_ENV_HOOK' >> /home/ray/.bashrc", + ], +) +``` diff --git a/docs/guides/executors/lepton.md b/docs/guides/executors/lepton.md new file mode 100644 index 00000000..05703b73 --- /dev/null +++ b/docs/guides/executors/lepton.md @@ -0,0 +1,105 @@ +# LeptonExecutor + +Launch distributed batch jobs on NVIDIA DGX Cloud Lepton. + +## Prerequisites + +- [DGX Cloud Lepton CLI](https://docs.nvidia.com/dgx-cloud/lepton/reference/cli/get-started/) installed and authenticated (`lep workspace info` should return your workspace) +- A node group with sufficient GPU capacity +- A remote storage mount accessible from the job pods + +```{note} +For Ray workloads on Lepton (e.g. `RayCluster` / `RayJob`), see [Ray Clusters & Jobs](../ray.md) instead. +``` + +## Executor configuration + +```python +import nemo_run as run + +executor = run.LeptonExecutor( + resource_shape="gpu.8xh100-80gb", # resource shape = GPUs per pod + node_group="my-node-group", + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + nodes=1, + gpus_per_node=8, + nemo_run_dir="/nemo-workspace/nemo-run", # path on remote storage for NeMo-Run metadata + mounts=[{ + "path": "/nemo-workspace", # remote storage path + "mount_path": "/nemo-workspace", # container mount point + }], + env_vars={"PYTHONUNBUFFERED": "1"}, +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `resource_shape` | Resource shape string (encodes GPU count per pod) | +| `node_group` | Lepton node group to schedule on | +| `container_image` | Container image URI | +| `nodes` | Number of pods | +| `gpus_per_node` | GPUs per pod | +| `nemo_run_dir` | Directory on remote storage where NeMo-Run saves experiment metadata | +| `mounts` | Remote storage mounts (`path` + `mount_path`) | + +## E2E workflow + +```python +import nemo_run as run + +task = run.Script("python train.py --lr=3e-4 --max-steps=500") + +executor = run.LeptonExecutor( + resource_shape="gpu.8xh100-80gb", + node_group="my-node-group", + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + nodes=1, + gpus_per_node=8, + nemo_run_dir="/nemo-workspace/nemo-run", + mounts=[{"path": "/nemo-workspace", "mount_path": "/nemo-workspace"}], +) + +with run.Experiment("my-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=True) + +# Later — reconnect and check status +experiment = run.Experiment.from_id("my-experiment_") +experiment.status() +experiment.logs("training") +``` + +## Advanced options + +### Node reservation + +Pin the job to a specific reserved node group: + +```python +executor = run.LeptonExecutor( + ..., + node_reservation="my-node-reservation", +) +``` + +### Pre-launch commands + +Run shell commands inside the container before the job starts: + +```python +executor = run.LeptonExecutor( + ..., + pre_launch_commands=["nvidia-smi", "pip install --upgrade my-package"], +) +``` + +### Private registry images + +```python +executor = run.LeptonExecutor( + ..., + image_pull_secrets=["my-registry-secret"], +) +``` diff --git a/docs/guides/executors/local.md b/docs/guides/executors/local.md new file mode 100644 index 00000000..0c1b38e8 --- /dev/null +++ b/docs/guides/executors/local.md @@ -0,0 +1,63 @@ +# LocalExecutor + +Run tasks directly on your local machine in a separate subprocess. + +## Prerequisites + +None. `LocalExecutor` works out of the box with a standard NeMo-Run installation. + +## Executor configuration + +```python +import nemo_run as run + +executor = run.LocalExecutor() +``` + +`LocalExecutor` has no required parameters. Optional fields mirror the base `Executor`: + +- `env_vars` — extra environment variables passed to the subprocess +- `launcher` — optional launcher (`"torchrun"`, `"ft"`, or `None`) +- `ntasks_per_node` — number of tasks to launch per node (default: `1`) + +## E2E workflow + +```python +import nemo_run as run + +task = run.Script("python train.py --lr=3e-4 --max-steps=500") +executor = run.LocalExecutor() + +with run.Experiment("my-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=False) + +exp.status() +exp.logs("training") +``` + +## Advanced options + +### Torchrun launcher + +Use `torchrun` for multi-GPU distributed training: + +```python +executor = run.LocalExecutor( + launcher="torchrun", + ntasks_per_node=4, # number of GPUs + env_vars={"NCCL_DEBUG": "INFO"}, +) +``` + +### Inline script + +Pass a Python snippet directly without creating a file: + +```python +task = run.Script(inline="import socket; print(socket.gethostname())") + +with run.Experiment("inline-experiment") as exp: + exp.add(task, executor=run.LocalExecutor(), name="hostname") + exp.run(detach=False) +``` diff --git a/docs/guides/executors/skypilot.md b/docs/guides/executors/skypilot.md new file mode 100644 index 00000000..a13d8bb5 --- /dev/null +++ b/docs/guides/executors/skypilot.md @@ -0,0 +1,100 @@ +# SkypilotExecutor + +Launch tasks across clouds (AWS, GCP, Azure, Kubernetes, and more) via [SkyPilot](https://skypilot.readthedocs.io/). + +## Prerequisites + +1. Install the SkyPilot extras: + + ```bash + pip install "nemo_run[skypilot]" + ``` + +2. Configure at least one cloud with `sky check`. Follow the [SkyPilot cloud setup guide](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) for your provider. + +## Executor configuration + +```python +from nemo_run.core.execution.skypilot import SkypilotExecutor + +executor = SkypilotExecutor( + cloud="kubernetes", # or "aws", "gcp", "azure", … + gpus="A100", # GPU type string recognised by SkyPilot + gpus_per_node=8, + num_nodes=1, + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + env_vars={"PYTHONUNBUFFERED": "1"}, + # Optional: reuse an existing cluster instead of provisioning a new one + cluster_name="my-sky-cluster", + setup=""" + conda deactivate + nvidia-smi + """, +) +``` + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `cloud` | Cloud provider or `"kubernetes"` | +| `gpus` | GPU type string (e.g. `"A100"`, `"H100"`) | +| `gpus_per_node` | GPUs per node | +| `num_nodes` | Number of nodes | +| `container_image` | Docker image for the job | +| `cluster_name` | Optional: name of an existing cluster to reuse | +| `setup` | Shell commands to run once on the cluster before the job | + +## E2E workflow + +```python +import nemo_run as run +from nemo_run.core.execution.skypilot import SkypilotExecutor + +task = run.Script("python train.py --lr=3e-4 --max-steps=500") + +executor = SkypilotExecutor( + cloud="kubernetes", + gpus="RTX5880-ADA-GENERATION", + gpus_per_node=8, + num_nodes=1, + container_image="nvcr.io/nvidia/pytorch:24.05-py3", +) + +with run.Experiment("my-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=True) + +# Later — reconnect and check status +experiment = run.Experiment.from_id("my-experiment_") +experiment.status() +experiment.logs("training") +``` + +## Advanced options + +### SkypilotJobsExecutor (managed jobs) + +`SkypilotJobsExecutor` submits [SkyPilot Managed Jobs](https://skypilot.readthedocs.io/en/latest/running-jobs/managed-jobs.html), which survive controller failures and support spot instances with auto-recovery: + +```python +from nemo_run.core.execution.skypilot import SkypilotJobsExecutor + +executor = SkypilotJobsExecutor( + cloud="aws", + gpus="A100", + gpus_per_node=8, + num_nodes=4, + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + use_spot=True, +) +``` + +### Package code from git + +```python +executor = SkypilotExecutor( + ..., + packager=run.GitArchivePackager(subpath="src"), +) +``` diff --git a/docs/guides/executors/slurm.md b/docs/guides/executors/slurm.md new file mode 100644 index 00000000..c49d3c20 --- /dev/null +++ b/docs/guides/executors/slurm.md @@ -0,0 +1,138 @@ +# SlurmExecutor + +Launch tasks on a Slurm HPC cluster, optionally from your local machine over SSH. + +## Prerequisites + +- Access to a Slurm cluster with Pyxis installed +- SSH key authentication set up (for remote launch via `SSHTunnel`) +- A container image accessible from the cluster (e.g. on a shared registry or pulled to the nodes) + +## Executor configuration + +```python +import nemo_run as run +from nemo_run import GitArchivePackager + +# Connect to the cluster over SSH (omit if you're already on the cluster) +ssh_tunnel = run.SSHTunnel( + host="login.my-cluster.com", + user="your-username", + job_dir="/scratch/your-username/nemo-runs", # where NeMo-Run stores metadata on the cluster + identity="~/.ssh/id_ed25519", # optional SSH key path +) + +executor = run.SlurmExecutor( + account="your-account", + partition="your-partition", + nodes=1, + ntasks_per_node=8, + gpus_per_node=8, + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + time="00:30:00", + tunnel=ssh_tunnel, + packager=GitArchivePackager(subpath="src"), # optional: package code from git + env_vars={"PYTHONUNBUFFERED": "1"}, +) +``` + +Use `run.LocalTunnel()` instead of `SSHTunnel` when launching from a login node directly. + +Key parameters: + +| Parameter | Description | +|-----------|-------------| +| `account` | Slurm account / project to charge | +| `partition` | Target partition | +| `nodes` | Number of nodes | +| `ntasks_per_node` | Processes per node (usually equals GPU count) | +| `gpus_per_node` | GPUs per node | +| `container_image` | Container image URI | +| `time` | Wall-time limit (`"HH:MM:SS"`) | +| `tunnel` | `SSHTunnel` (remote) or `LocalTunnel` (on-cluster) | +| `packager` | Code packaging strategy | + +## E2E workflow + +```python +import nemo_run as run + +task = run.Script("python train.py --lr=3e-4 --max-steps=500") + +executor = run.SlurmExecutor( + account="my-account", + partition="a100", + nodes=1, + ntasks_per_node=8, + gpus_per_node=8, + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + time="01:00:00", + tunnel=run.SSHTunnel( + host="login.my-cluster.com", + user="myuser", + job_dir="/scratch/myuser/runs", + ), +) + +with run.Experiment("my-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=True) # detach=True: returns after scheduling the Slurm job + +# Later — reconnect and check status +experiment = run.Experiment.from_id("my-experiment_") +experiment.status() +experiment.logs("training") +``` + +## Advanced options + +### Job dependencies + +Chain jobs so that the second only starts after the first succeeds: + +```python +with run.Experiment("pipeline") as exp: + prep_id = exp.add(data_prep_task, executor=executor, name="data-prep") + exp.add( + train_task, + executor=run.SlurmExecutor( + dependency_type="afterok", # start only after prep succeeds + **executor_kwargs, + ), + name="training", + dependencies=[prep_id], + ) + exp.run(detach=True) +``` + +`dependency_type` options: `"afterok"` (default), `"afterany"`, `"afternotok"`. See the [Slurm documentation](https://slurm.schedmd.com/sbatch.html#OPT_dependency) for the full list. + +### Torchrun launcher + +```python +executor = run.SlurmExecutor( + ..., + launcher="torchrun", + ntasks_per_node=8, +) +``` + +### Custom stdout/stderr paths + +Subclass `SlurmJobDetails` to redirect Slurm logs: + +```python +from pathlib import Path +from nemo_run.core.execution.slurm import SlurmJobDetails + +class MyJobDetails(SlurmJobDetails): + @property + def stdout(self) -> Path: + return Path(self.folder) / "job.out" + + @property + def stderr(self) -> Path: + return Path(self.folder) / "job.err" + +executor.job_details = MyJobDetails() +``` diff --git a/docs/guides/index.md b/docs/guides/index.md index f8d5fcf1..b9bb38d2 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -6,37 +6,39 @@ :hidden: why-use-nemo-run +quickstart configuration execution +executors/index management -ray cli +ray +architecture ::: Welcome to the NeMo-Run guides! This section provides comprehensive documentation on how to use NeMo-Run effectively for your machine learning experiments. ## Get Started -If you're new to NeMo-Run, we recommend starting with: +If you're new to NeMo-Run, follow the guides in this order: -- **[Why Use NeMo-Run?](why-use-nemo-run.md)** - Understand the benefits and philosophy behind NeMo-Run. -- **[Configuration](configuration.md)** - Learn how to configure your ML tasks and experiments. -- **[Execution](execution.md)** - Discover how to run your experiments across different computing environments. -- **[Management](management.md)** - Master experiment tracking, reproducibility, and organization. +1. **[Why Use NeMo-Run?](why-use-nemo-run.md)** — Understand the benefits and philosophy. +2. **[Quickstart](quickstart.md)** — Get something running in 5 minutes. +3. **[Configuration](configuration.md)** — Learn how to configure tasks and experiments. +4. **[Execution](execution.md)** — Understand executors, packagers, and launchers. +5. **[Executors](executors/index.md)** — Per-executor guides from local to cloud. +6. **[Management](management.md)** — Track, inspect, and reproduce past experiments. ## Advanced Topics -For more advanced usage: - -- **[Ray Integration](ray.md)** - Learn how to use NeMo-Run with Ray for distributed computing. -- **[CLI Reference](cli.md)** - Explore the command-line interface for NeMo-Run. +- **[CLI Reference](cli.md)** — Automate experiment management from the command line. +- **[Ray Integration](ray.md)** — Distributed Ray workloads on Kubernetes, Slurm, and Lepton. +- **[Architecture](architecture.md)** — Internals for contributors and power users. ## Core Concepts NeMo-Run is built around three core responsibilities: -1. **Configuration** - Define your ML experiments using a flexible, Pythonic configuration system. -1. **Execution** - Run your experiments seamlessly across local machines, Slurm clusters, cloud providers, and more. -1. **Management** - Track, reproduce, and organize your experiments with built-in experiment management. - -Each guide dives deep into these concepts with practical examples and best practices. Choose a guide above to get started! +1. **Configuration** — Define ML experiments using a flexible, Pythonic configuration system. +2. **Execution** — Run experiments seamlessly across local machines, Slurm clusters, cloud providers, and more. +3. **Management** — Track, reproduce, and organize experiments with built-in experiment management. diff --git a/docs/guides/management.md b/docs/guides/management.md index 66a4f25e..9f578702 100644 --- a/docs/guides/management.md +++ b/docs/guides/management.md @@ -138,3 +138,50 @@ nemorun experiment cancel experiment_with_scripts_1720556256 0 This information is specific to each experiment on how to manage it. See [this notebook](https://github.com/NVIDIA-NeMo/Run/blob/main/examples/hello-world/hello_experiments.ipynb) for more details and a playable experience. + +--- + +## Putting it all together + +This end-to-end example combines configuration, a remote executor, and management into a single workflow. + +```python +import nemo_run as run + +# 1. Define the task as a shell script +task = run.Script("python train.py --lr=3e-4 --max-steps=1000") + +# 2. Configure a remote executor (Slurm shown; swap for any other executor) +executor = run.SlurmExecutor( + account="my-account", + partition="a100", + nodes=1, + ntasks_per_node=8, + gpus_per_node=8, + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + time="02:00:00", + tunnel=run.SSHTunnel( + host="login.my-cluster.com", + user="myuser", + job_dir="/scratch/myuser/nemo-runs", + ), +) + +# 3. Launch the experiment and detach (returns after scheduling) +with run.Experiment("pretrain-run") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=True) + +# 4. Reconnect at any later time using the printed experiment ID +experiment = run.Experiment.from_id("pretrain-run_") +experiment.status() # overall status +experiment.logs("training") # stream logs +# experiment.cancel("training") # cancel if needed +``` + +```bash +# The same operations from the CLI: +nemorun experiment status pretrain-run_ +nemorun experiment logs pretrain-run_ 0 +nemorun experiment cancel pretrain-run_ 0 +``` diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md new file mode 100644 index 00000000..2cb4aff3 --- /dev/null +++ b/docs/guides/quickstart.md @@ -0,0 +1,61 @@ +# Quickstart + +Get NeMo-Run working in under 5 minutes — no cluster, no SSH, no Docker. + +## Install + +```bash +pip install nemo_run +``` + +## Define a task + +A task is a `run.Script` wrapping a shell command or an inline Python snippet: + +```python +import nemo_run as run + +# Shell command +task = run.Script("python train.py --lr=3e-4 --max-steps=500") + +# Or an inline script string +task = run.Script(inline="print('Training with lr=3e-4, max_steps=500')") +``` + +## Run locally + +Use `run.LocalExecutor` to run the task in a subprocess on your machine: + +```python +executor = run.LocalExecutor() + +with run.Experiment("my-first-experiment") as exp: + exp.add(task, executor=executor, name="training") + exp.run(detach=False) +``` + +`detach=False` blocks until all tasks finish and streams logs to your terminal. + +## Inspect the result + +After the experiment finishes, NeMo-Run prints a snippet you can use later: + +```python +experiment = run.Experiment.from_id("my-first-experiment_") +experiment.status() +experiment.logs("training") +``` + +Replace `` with the timestamp printed when the experiment ran. + +--- + +## What's next + +| Topic | Guide | +|-------|-------| +| Configuring tasks with `Script` and `Config` / `Partial` | [Configuration](configuration.md) | +| Packagers, launchers, and the executor concept | [Execution](execution.md) | +| Run on a Docker container | [Docker executor](executors/docker.md) | +| Run on a Slurm cluster | [Slurm executor](executors/slurm.md) | +| Track and reproduce past experiments | [Management](management.md) | diff --git a/docs/guides/ray.md b/docs/guides/ray.md index 7fdff229..e5657732 100644 --- a/docs/guides/ray.md +++ b/docs/guides/ray.md @@ -1,9 +1,37 @@ # Ray Clusters & Jobs -> **Audience**: You already know how to configure executors with NeMo-Run and want distributed *Ray* on either Kubernetes **or** Slurm. +> **Audience**: You already know how to configure executors with NeMo-Run and want distributed *Ray* on either Kubernetes **or** Slurm **or** Lepton. > > **TL;DR**: `RayCluster` manages the _cluster_; `RayJob` submits a job with an ephemeral cluster. Everything else is syntactic sugar. +## When to use Ray vs. standard execution + +| | Standard `run.Experiment` | Ray (`RayCluster` / `RayJob`) | +|-|---------------------------|-------------------------------| +| **Task style** | Python callable or shell script | Ray application (uses `ray.init()`, `@ray.remote`, etc.) | +| **Distributed framework** | `torchrun`, fault-tolerant launcher | Ray core, Ray Train, RLlib, … | +| **Cluster lifetime** | One job per experiment task | Persistent cluster reused across many jobs | +| **Interactive use** | Not designed for it | Supported via `port_forward` + Ray dashboard | +| **Best for** | PyTorch training, eval, batch jobs | RL workloads, hyper-param sweeps, actor-based pipelines | + +Use Ray when your workload is written against the Ray API. Use standard execution otherwise — it has lower overhead and simpler setup. + +## Prerequisites + +Install NeMo-Run (Ray extras are included by default): + +```bash +pip install nemo_run +``` + +Backend-specific requirements: + +| Backend | What you need | +|---------|---------------| +| **KubeRay** | `kubectl` configured + KubeRay operator installed. See [KubeRayExecutor](executors/kuberay.md). | +| **Slurm** | SSH access to a Slurm cluster with Pyxis. See [SlurmExecutor](executors/slurm.md). | +| **Lepton** | Lepton CLI installed and authenticated. See [LeptonExecutor](executors/lepton.md). | + ## RayCluster vs. RayJob – which one do I need? | Aspect | RayCluster (interactive) | RayJob (batch) | @@ -42,6 +70,8 @@ classDiagram ## 2. KubeRay quick-start +> **Executor reference**: [executors/kuberay.md](executors/kuberay.md) — prerequisites, `KubeRayExecutor` parameters, PVC mounts, custom scheduler. + ```python from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup from nemo_run.run.ray.cluster import RayCluster @@ -113,6 +143,8 @@ cluster.stop() ## 3. Slurm quick-start +> **Executor reference**: [executors/slurm.md](executors/slurm.md) — SSH tunnel setup, `SlurmExecutor` parameters, job dependencies. + ```python import os from pathlib import Path @@ -185,8 +217,14 @@ cluster.stop() * `executor.packager = run.GitArchivePackager()` if you prefer packaging a git tree instead of rsync. * `cluster.port_forward()` opens an SSH tunnel from *your laptop* to the Ray dashboard running on the head node. +```{note} +`CustomJobDetails` (shown above) is an advanced pattern for redirecting Slurm stdout/stderr. It is not required for most workloads. +``` + ## 4. DGX Cloud Lepton RayCluster quick-start +> **Executor reference**: [executors/lepton.md](executors/lepton.md) — Lepton CLI setup, `LeptonExecutor` parameters, mounts, node reservations. + ```python import os from pathlib import Path