Skip to content

Conversation

@rohitc33
Copy link

@rohitc33 rohitc33 commented Nov 4, 2025

Disabled by default. Can be enabled by passing --bundler_spec=enable_colocated_python=True when launching a Pathways job or LWS job. Both artifact registry and CloudBuild bundlers are supported.

This PR is a refactored version of #1350

@rohitc33 rohitc33 requested review from a team as code owners November 4, 2025 19:19
@rohitc33 rohitc33 force-pushed the rohit-colocated-2 branch 2 times, most recently from 2faf540 to 6eac867 Compare November 6, 2025 22:27
Copy link
Contributor

@muyangyuapple muyangyuapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you please add it to the lws pathways job? https://github.pie.apple.com/foundation-models/axlearn/blob/3103ca9e3967e73347d89325ab5887f0a0efb90a/axlearn/cloud/gcp/runners/__init__.py#L63

This is used in long running inference service

]
env:
- "DOCKER_BUILDKIT=1"
- "DOCKER_BUILDKIT=1\""""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to branching the logic in the bundler.py file?

I think we just add some extra dependencies for colocated python. We can use a env_var in the Dockerfile to control that like this: https://github.com/apple/axlearn/blob/main/Dockerfile#L104

Which can be passed in via something like --bundler_spec=INSTALL_PATHWAYS_JAXLIB=true \

Copy link
Author

@rohitc33 rohitc33 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable colocated python, an extra image for the colocated python sidecar running on each worker pod needs to be built (so 2 images get built). This implementation works the same for the user - to enable it you just need to pass in --bundler_spec=enable_colocated_python=True.

Copy link
Contributor

@muyangyuapple muyangyuapple Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can head pod and sidecar on worker share the same image?

If there are two images to build, we should define another bundler called "sidecar_bundler" here. Adding logics to bundler that is used only by a subset of jobs will make bundler hard to maintain in the long run.


def __init__(self, cfg: Config, *, bundler: Bundler):
super().__init__(cfg)
self._enable_colocated_python = getattr(bundler.config, "enable_colocated_python", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_colocated_python should be an attribute in PathwaysReplicatedJob

Also logically, when you use PathwaysColocatedPythonPlugin, you implies that colocated python is already enabled? So this class should not have a attribute call self._enable_colocated_python?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PathwaysColocatedPythonPlugin is for job types which support colocated python but doesn't require it to be enabled; the properties pathways_server_image and pathways_proxy_image return different values based on whether colocated python is enabled or disabled. Later on, it could be worth moving more pathways-specific logic to PathwaysColocatedPythonPlugin and making it just PathwaysPlugin - currently PathwaysReplicatedJob and PathwaysLeaderWorkerTemplate have a lot of duplicated logic.

if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS:
raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}")

self._colocated_python = cfg.colocated_python.instantiate(bundler=bundler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should instantiate it only when colocated python is enabled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still used when colocated python is disabled to get pathways_server_image and pathways_proxy_image.

@rohitc33
Copy link
Author

Also can you please add it to the lws pathways job? https://github.pie.apple.com/foundation-models/axlearn/blob/3103ca9e3967e73347d89325ab5887f0a0efb90a/axlearn/cloud/gcp/runners/__init__.py#L63

This is used in long running inference service

It's already implemented in PathwaysLeaderWorkerTemplate.

Co-authored-by: lkolluru05 <lkolluru@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants