Skip to content

feat: add DockerURL type to fix cache_docker_images_locally=false#851

Open
rutayan-nv wants to merge 2 commits intoNVIDIA:mainfrom
rutayan-nv:fix/docker-url-type-for-cache-disabled
Open

feat: add DockerURL type to fix cache_docker_images_locally=false#851
rutayan-nv wants to merge 2 commits intoNVIDIA:mainfrom
rutayan-nv:fix/docker-url-type-for-cache-disabled

Conversation

@rutayan-nv
Copy link

@rutayan-nv rutayan-nv commented Mar 26, 2026

Summary

This PR fixes container path resolution when cache_docker_images_locally=false is set in the system TOML configuration.

Problem:
When cache_docker_images_locally=false, the DockerImage.installed_path property returns the raw Docker URL (e.g., nvcr.io/nvidian/nemo:26.04.rc2). Downstream code in slurm_command_gen_strategy.py then calls Path(url).absolute() on this URL, which incorrectly converts it to a bogus local filesystem path like /current/working/dir/nvcr.io/nvidian/nemo:26.04.rc2. Pyxis then fails with "No such file or directory" because this path doesn't exist.

Solution:

  • Introduce a DockerURL(str) type in _core/types.py to semantically distinguish Docker registry URLs from filesystem paths
  • Update DockerImage.installed_path to return DockerURL when the image is not cached locally
  • Update slurm_command_gen_strategy.py to check isinstance(installed, Path) before calling .absolute(), passing DockerURL values directly to pyxis

This allows pyxis to correctly receive the Docker URL and pull the image from the registry at runtime.

Related Issue: #850

Test Plan

Testing Environment:

  • EOS cluster with Slurm scheduler
  • Container: nvcr.io/nvidian/nemo:26.04.rc2
  • System config: cache_docker_images_locally not set (defaults to false)

Steps:

  1. Ran unit tests locally:

    uv run python -m pytest tests/test_cache_docker_images_locally_false.py -v

    Output: 6 passed

  2. Deployed to EOS and ran Megatron-Bridge workload:

    cloudaix run --system-config eos.toml --test-scenario qwen3_235b_...toml
  3. Verified sbatch script contains correct container image:

    --container-image nvcr.io/nvidian/nemo:26.04.rc2
    

    (URL passed directly, not mangled into a local path)

  4. Verified pyxis successfully imported the image:

    pyxis: imported docker image: nvcr.io/nvidian/nemo:26.04.rc2
    
  5. Training started successfully:

    Starting training loop at iteration 0
    [2026-03-25 18:32:28] iteration 1/10 | lm loss: 1.275077E+01 | elapsed: 83974ms
    [2026-03-25 18:32:53] iteration 2/10 | lm loss: 1.275050E+01 | elapsed: 24361ms
    

Additional Notes

  • The DockerURL type is a simple str subclass, maintaining full backward compatibility with string operations
  • The type check uses isinstance(installed, Path) which is more robust than string pattern matching
  • This fix only affects the Megatron-Bridge workload's command generation; other workloads may need similar updates if they use docker_image.installed_path
  • Future consideration: The related issue Design issue: [[git_repos]] mount_as creates fragile version mismatch #850 discusses broader design concerns with [[git_repos]] and mount_as creating version mismatches

When cache_docker_images_locally=false, DockerImage.installed_path
was returning the raw URL string which downstream code passed through
Path().absolute(), creating bogus paths like "/cwd/nvcr.io/...".

This fix:
- Adds DockerURL(str) type in _core/types.py to distinguish URLs from paths
- Updates installed_path property to return DockerURL when not cached
- Updates slurm_command_gen_strategy.py to check isinstance(Path) vs DockerURL
  and only call .absolute() on actual Path objects

Now pyxis correctly receives the URL and can pull from registry.

Fixes: cache_docker_images_locally=false causing container not found errors
Made-with: Cursor
Tests verify:
- installed_path returns DockerURL when _installed_path is not set
- installed_path returns Path when _installed_path is set
- DockerURL is a str subclass for compatibility
- isinstance() can distinguish DockerURL from Path
- Container path resolution logic handles both types correctly
- cache_filename generation works correctly

Made-with: Cursor
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

The changes introduce a new DockerURL type—a string subclass—to distinguish Docker registry URLs from filesystem paths. The DockerImage.installed_path property now returns Union[DockerURL, Path] instead of Union[str, Path], wrapping uncached URLs with the new type. Container path resolution logic is updated to handle both types appropriately via isinstance checks.

Changes

Cohort / File(s) Summary
Type System Enhancement
src/cloudai/_core/types.py
New module defining DockerURL as a str subclass to represent Docker registry URLs and enable runtime type distinction from filesystem paths.
Core Logic Updates
src/cloudai/_core/installables.py, src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Modified DockerImage.installed_path property signature and return type; updated _installed_container_path() to handle both Path and DockerURL types via isinstance checks instead of coercing all inputs through Path.absolute().
Comprehensive Testing
tests/test_cache_docker_images_locally_false.py
New test suite validating DockerURL behavior, installed_path return types under cached and uncached scenarios, container path resolution logic, and cache filename generation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A URL wrapped in type, a path standing tall,
isinstance checks now distinguish them all,
No more string confusion, just clarity bright—
Docker knows registry, disk knows its right!
A rabbit hops forward through types now refined.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: introducing a DockerURL type to fix the cache_docker_images_locally=false configuration issue.
Description check ✅ Passed The description is well-structured, providing problem context, solution details, test results, and deployment validation that directly relate to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@rutayan-nv rutayan-nv changed the title Fix/docker url type for cache disabled feat: add DockerURL type to fix cache_docker_images_locally=false Mar 26, 2026
@rutayan-nv rutayan-nv marked this pull request as ready for review March 26, 2026 02:09
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/_core/installables.py (1)

78-95: ⚠️ Potential issue | 🟠 Major

Tighten installed_path storage to preserve the new DockerURL | Path contract.

Line 90 can still return a plain str because _installed_path/setter accept str. That breaks the runtime type distinction this PR introduces.

🔧 Proposed fix
 `@dataclass`
 class DockerImage(Installable):
@@
-    _installed_path: Optional[Union[str, Path]] = field(default=None, repr=False)
+    _installed_path: Optional[DockerURL | Path] = field(default=None, repr=False)
@@
-    def installed_path(self) -> Union[DockerURL, Path]:
+    def installed_path(self) -> DockerURL | Path:
@@
     `@installed_path.setter`
-    def installed_path(self, value: Union[str, Path]) -> None:
-        self._installed_path = value
+    def installed_path(self, value: str | DockerURL | Path) -> None:
+        if isinstance(value, Path):
+            self._installed_path = value
+        elif isinstance(value, DockerURL):
+            self._installed_path = value
+        elif value.startswith("/") or value.startswith("."):
+            self._installed_path = Path(value)
+        else:
+            self._installed_path = DockerURL(value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/_core/installables.py` around lines 78 - 95, The setter for
installed_path must ensure self._installed_path is always either a Path or
DockerURL (not a plain str); update the installed_path.setter to accept
str|Path|DockerURL but convert any incoming str to a Path (e.g. value =
Path(value)) before storing so _installed_path never holds a raw str, and update
the setter type hints accordingly; keep the installed_path getter logic as-is
(it already handles Path vs DockerURL).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_cache_docker_images_locally_false.py`:
- Around line 43-49: Remove the explicit assertion message strings from the
pytest asserts checking img._installed_path, isinstance(result, DockerURL) and
equality of result == docker_url (and the similar asserts around lines 62-63);
keep the assertions themselves (e.g., assert img._installed_path is None, assert
isinstance(result, DockerURL), assert result == docker_url) so pytest can use
its built-in introspection rather than custom message text.
- Around line 90-112: The test currently defines a local helper
resolve_container_path which duplicates production logic and fails to protect
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py from
regressions; update the test to exercise the real resolver instead of the local
stub by either (A) importing and calling the production resolver used in
slurm_command_gen_strategy.py (remove the local resolve_container_path and call
that resolver directly) or (B) refactoring the resolver into a shared function
(e.g., a new util exported from slurm_command_gen_strategy or a new module) and
have the test import and assert on that shared function; ensure the test still
covers both DockerURL and Path cases (use DockerURL(...) and Path(...)) so it
fails if the production resolver regresses.

---

Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Around line 78-95: The setter for installed_path must ensure
self._installed_path is always either a Path or DockerURL (not a plain str);
update the installed_path.setter to accept str|Path|DockerURL but convert any
incoming str to a Path (e.g. value = Path(value)) before storing so
_installed_path never holds a raw str, and update the setter type hints
accordingly; keep the installed_path getter logic as-is (it already handles Path
vs DockerURL).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7f4bc4d1-89f7-41e1-ac3d-0586d5bc2904

📥 Commits

Reviewing files that changed from the base of the PR and between 96f185e and cecc3ae.

📒 Files selected for processing (4)
  • src/cloudai/_core/installables.py
  • src/cloudai/_core/types.py
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
  • tests/test_cache_docker_images_locally_false.py

Comment on lines +43 to +49
assert img._installed_path is None, "Precondition: _installed_path should be None"

result = img.installed_path

# Returns DockerURL type (not plain str)
assert isinstance(result, DockerURL), f"Expected DockerURL, got {type(result)}"
assert result == docker_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Drop explicit assertion messages in these pytest asserts.

Pytest already renders clear actual/expected introspection, so these custom messages add noise.

🧹 Proposed cleanup
-        assert img._installed_path is None, "Precondition: _installed_path should be None"
+        assert img._installed_path is None
@@
-        assert isinstance(result, DockerURL), f"Expected DockerURL, got {type(result)}"
+        assert isinstance(result, DockerURL)
@@
-        assert isinstance(result, Path), f"Expected Path, got {type(result)}"
+        assert isinstance(result, Path)
Based on learnings: In pytest, rely on assertion introspection for failure details and avoid including explicit values in the assertion message.

Also applies to: 62-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cache_docker_images_locally_false.py` around lines 43 - 49, Remove
the explicit assertion message strings from the pytest asserts checking
img._installed_path, isinstance(result, DockerURL) and equality of result ==
docker_url (and the similar asserts around lines 62-63); keep the assertions
themselves (e.g., assert img._installed_path is None, assert isinstance(result,
DockerURL), assert result == docker_url) so pytest can use its built-in
introspection rather than custom message text.

Comment on lines +90 to +112
def test_container_path_resolution_logic(self):
"""
Test the correct container path resolution logic.

This is what slurm_command_gen_strategy.py should do:
- Path: call .absolute() and convert to str
- DockerURL: pass directly as str
"""
def resolve_container_path(installed) -> str:
if isinstance(installed, Path):
return str(installed.absolute())
# DockerURL - pass directly to pyxis
return str(installed)

# DockerURL case
url_result = resolve_container_path(DockerURL("nvcr.io/nvidian/nemo:26.04.rc2"))
assert url_result == "nvcr.io/nvidian/nemo:26.04.rc2"
assert not url_result.startswith("/") # Not mangled into a local path

# Path case
path_result = resolve_container_path(Path("/install/image.sqsh"))
assert path_result == "/install/image.sqsh"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This test validates a local reimplementation, not the production path resolver.

resolve_container_path() at Line 98 duplicates intended behavior, so this test can stay green while src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py regresses.

Please exercise the real command-generation path (or extract the resolver into a shared function and test that directly) so this test guards the actual integration behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_cache_docker_images_locally_false.py` around lines 90 - 112, The
test currently defines a local helper resolve_container_path which duplicates
production logic and fails to protect
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py from
regressions; update the test to exercise the real resolver instead of the local
stub by either (A) importing and calling the production resolver used in
slurm_command_gen_strategy.py (remove the local resolve_container_path and call
that resolver directly) or (B) refactoring the resolver into a shared function
(e.g., a new util exported from slurm_command_gen_strategy or a new module) and
have the test import and assert on that shared function; ensure the test still
covers both DockerURL and Path cases (use DockerURL(...) and Path(...)) so it
fails if the production resolver regresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant