feat: add DockerURL type to fix cache_docker_images_locally=false#851
feat: add DockerURL type to fix cache_docker_images_locally=false#851rutayan-nv wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
When cache_docker_images_locally=false, DockerImage.installed_path was returning the raw URL string which downstream code passed through Path().absolute(), creating bogus paths like "/cwd/nvcr.io/...". This fix: - Adds DockerURL(str) type in _core/types.py to distinguish URLs from paths - Updates installed_path property to return DockerURL when not cached - Updates slurm_command_gen_strategy.py to check isinstance(Path) vs DockerURL and only call .absolute() on actual Path objects Now pyxis correctly receives the URL and can pull from registry. Fixes: cache_docker_images_locally=false causing container not found errors Made-with: Cursor
Tests verify: - installed_path returns DockerURL when _installed_path is not set - installed_path returns Path when _installed_path is set - DockerURL is a str subclass for compatibility - isinstance() can distinguish DockerURL from Path - Container path resolution logic handles both types correctly - cache_filename generation works correctly Made-with: Cursor
📝 WalkthroughWalkthroughThe changes introduce a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/_core/installables.py (1)
78-95:⚠️ Potential issue | 🟠 MajorTighten
installed_pathstorage to preserve the newDockerURL | Pathcontract.Line 90 can still return a plain
strbecause_installed_path/setter acceptstr. That breaks the runtime type distinction this PR introduces.🔧 Proposed fix
`@dataclass` class DockerImage(Installable): @@ - _installed_path: Optional[Union[str, Path]] = field(default=None, repr=False) + _installed_path: Optional[DockerURL | Path] = field(default=None, repr=False) @@ - def installed_path(self) -> Union[DockerURL, Path]: + def installed_path(self) -> DockerURL | Path: @@ `@installed_path.setter` - def installed_path(self, value: Union[str, Path]) -> None: - self._installed_path = value + def installed_path(self, value: str | DockerURL | Path) -> None: + if isinstance(value, Path): + self._installed_path = value + elif isinstance(value, DockerURL): + self._installed_path = value + elif value.startswith("/") or value.startswith("."): + self._installed_path = Path(value) + else: + self._installed_path = DockerURL(value)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/_core/installables.py` around lines 78 - 95, The setter for installed_path must ensure self._installed_path is always either a Path or DockerURL (not a plain str); update the installed_path.setter to accept str|Path|DockerURL but convert any incoming str to a Path (e.g. value = Path(value)) before storing so _installed_path never holds a raw str, and update the setter type hints accordingly; keep the installed_path getter logic as-is (it already handles Path vs DockerURL).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_cache_docker_images_locally_false.py`:
- Around line 43-49: Remove the explicit assertion message strings from the
pytest asserts checking img._installed_path, isinstance(result, DockerURL) and
equality of result == docker_url (and the similar asserts around lines 62-63);
keep the assertions themselves (e.g., assert img._installed_path is None, assert
isinstance(result, DockerURL), assert result == docker_url) so pytest can use
its built-in introspection rather than custom message text.
- Around line 90-112: The test currently defines a local helper
resolve_container_path which duplicates production logic and fails to protect
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py from
regressions; update the test to exercise the real resolver instead of the local
stub by either (A) importing and calling the production resolver used in
slurm_command_gen_strategy.py (remove the local resolve_container_path and call
that resolver directly) or (B) refactoring the resolver into a shared function
(e.g., a new util exported from slurm_command_gen_strategy or a new module) and
have the test import and assert on that shared function; ensure the test still
covers both DockerURL and Path cases (use DockerURL(...) and Path(...)) so it
fails if the production resolver regresses.
---
Outside diff comments:
In `@src/cloudai/_core/installables.py`:
- Around line 78-95: The setter for installed_path must ensure
self._installed_path is always either a Path or DockerURL (not a plain str);
update the installed_path.setter to accept str|Path|DockerURL but convert any
incoming str to a Path (e.g. value = Path(value)) before storing so
_installed_path never holds a raw str, and update the setter type hints
accordingly; keep the installed_path getter logic as-is (it already handles Path
vs DockerURL).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7f4bc4d1-89f7-41e1-ac3d-0586d5bc2904
📒 Files selected for processing (4)
src/cloudai/_core/installables.pysrc/cloudai/_core/types.pysrc/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.pytests/test_cache_docker_images_locally_false.py
| assert img._installed_path is None, "Precondition: _installed_path should be None" | ||
|
|
||
| result = img.installed_path | ||
|
|
||
| # Returns DockerURL type (not plain str) | ||
| assert isinstance(result, DockerURL), f"Expected DockerURL, got {type(result)}" | ||
| assert result == docker_url |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Drop explicit assertion messages in these pytest asserts.
Pytest already renders clear actual/expected introspection, so these custom messages add noise.
🧹 Proposed cleanup
- assert img._installed_path is None, "Precondition: _installed_path should be None"
+ assert img._installed_path is None
@@
- assert isinstance(result, DockerURL), f"Expected DockerURL, got {type(result)}"
+ assert isinstance(result, DockerURL)
@@
- assert isinstance(result, Path), f"Expected Path, got {type(result)}"
+ assert isinstance(result, Path)Also applies to: 62-63
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_cache_docker_images_locally_false.py` around lines 43 - 49, Remove
the explicit assertion message strings from the pytest asserts checking
img._installed_path, isinstance(result, DockerURL) and equality of result ==
docker_url (and the similar asserts around lines 62-63); keep the assertions
themselves (e.g., assert img._installed_path is None, assert isinstance(result,
DockerURL), assert result == docker_url) so pytest can use its built-in
introspection rather than custom message text.
| def test_container_path_resolution_logic(self): | ||
| """ | ||
| Test the correct container path resolution logic. | ||
|
|
||
| This is what slurm_command_gen_strategy.py should do: | ||
| - Path: call .absolute() and convert to str | ||
| - DockerURL: pass directly as str | ||
| """ | ||
| def resolve_container_path(installed) -> str: | ||
| if isinstance(installed, Path): | ||
| return str(installed.absolute()) | ||
| # DockerURL - pass directly to pyxis | ||
| return str(installed) | ||
|
|
||
| # DockerURL case | ||
| url_result = resolve_container_path(DockerURL("nvcr.io/nvidian/nemo:26.04.rc2")) | ||
| assert url_result == "nvcr.io/nvidian/nemo:26.04.rc2" | ||
| assert not url_result.startswith("/") # Not mangled into a local path | ||
|
|
||
| # Path case | ||
| path_result = resolve_container_path(Path("/install/image.sqsh")) | ||
| assert path_result == "/install/image.sqsh" | ||
|
|
There was a problem hiding this comment.
This test validates a local reimplementation, not the production path resolver.
resolve_container_path() at Line 98 duplicates intended behavior, so this test can stay green while src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py regresses.
Please exercise the real command-generation path (or extract the resolver into a shared function and test that directly) so this test guards the actual integration behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_cache_docker_images_locally_false.py` around lines 90 - 112, The
test currently defines a local helper resolve_container_path which duplicates
production logic and fails to protect
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py from
regressions; update the test to exercise the real resolver instead of the local
stub by either (A) importing and calling the production resolver used in
slurm_command_gen_strategy.py (remove the local resolve_container_path and call
that resolver directly) or (B) refactoring the resolver into a shared function
(e.g., a new util exported from slurm_command_gen_strategy or a new module) and
have the test import and assert on that shared function; ensure the test still
covers both DockerURL and Path cases (use DockerURL(...) and Path(...)) so it
fails if the production resolver regresses.
Summary
This PR fixes container path resolution when
cache_docker_images_locally=falseis set in the system TOML configuration.Problem:
When
cache_docker_images_locally=false, theDockerImage.installed_pathproperty returns the raw Docker URL (e.g.,nvcr.io/nvidian/nemo:26.04.rc2). Downstream code inslurm_command_gen_strategy.pythen callsPath(url).absolute()on this URL, which incorrectly converts it to a bogus local filesystem path like/current/working/dir/nvcr.io/nvidian/nemo:26.04.rc2. Pyxis then fails with "No such file or directory" because this path doesn't exist.Solution:
DockerURL(str)type in_core/types.pyto semantically distinguish Docker registry URLs from filesystem pathsDockerImage.installed_pathto returnDockerURLwhen the image is not cached locallyslurm_command_gen_strategy.pyto checkisinstance(installed, Path)before calling.absolute(), passingDockerURLvalues directly to pyxisThis allows pyxis to correctly receive the Docker URL and pull the image from the registry at runtime.
Related Issue: #850
Test Plan
Testing Environment:
nvcr.io/nvidian/nemo:26.04.rc2cache_docker_images_locallynot set (defaults tofalse)Steps:
Ran unit tests locally:
Output: 6 passed
Deployed to EOS and ran Megatron-Bridge workload:
Verified sbatch script contains correct container image:
(URL passed directly, not mangled into a local path)
Verified pyxis successfully imported the image:
Training started successfully:
Additional Notes
DockerURLtype is a simplestrsubclass, maintaining full backward compatibility with string operationsisinstance(installed, Path)which is more robust than string pattern matchingdocker_image.installed_path[[git_repos]]andmount_ascreating version mismatches