-
Notifications
You must be signed in to change notification settings - Fork 387
Add Colocated Python support for Pathways #1353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2faf540 to
6eac867
Compare
muyangyuapple
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you please add it to the lws pathways job? https://github.pie.apple.com/foundation-models/axlearn/blob/3103ca9e3967e73347d89325ab5887f0a0efb90a/axlearn/cloud/gcp/runners/__init__.py#L63
This is used in long running inference service
| ] | ||
| env: | ||
| - "DOCKER_BUILDKIT=1" | ||
| - "DOCKER_BUILDKIT=1\"""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to branching the logic in the bundler.py file?
I think we just add some extra dependencies for colocated python. We can use a env_var in the Dockerfile to control that like this: https://github.com/apple/axlearn/blob/main/Dockerfile#L104
Which can be passed in via something like --bundler_spec=INSTALL_PATHWAYS_JAXLIB=true \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enable colocated python, an extra image for the colocated python sidecar running on each worker pod needs to be built (so 2 images get built). This implementation works the same for the user - to enable it you just need to pass in --bundler_spec=enable_colocated_python=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can head pod and sidecar on worker share the same image?
If there are two images to build, we should define another bundler called "sidecar_bundler" here. Adding logics to bundler that is used only by a subset of jobs will make bundler hard to maintain in the long run.
|
|
||
| def __init__(self, cfg: Config, *, bundler: Bundler): | ||
| super().__init__(cfg) | ||
| self._enable_colocated_python = getattr(bundler.config, "enable_colocated_python", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_colocated_python should be an attribute in PathwaysReplicatedJob
Also logically, when you use PathwaysColocatedPythonPlugin, you implies that colocated python is already enabled? So this class should not have a attribute call self._enable_colocated_python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PathwaysColocatedPythonPlugin is for job types which support colocated python but doesn't require it to be enabled; the properties pathways_server_image and pathways_proxy_image return different values based on whether colocated python is enabled or disabled. Later on, it could be worth moving more pathways-specific logic to PathwaysColocatedPythonPlugin and making it just PathwaysPlugin - currently PathwaysReplicatedJob and PathwaysLeaderWorkerTemplate have a lot of duplicated logic.
| if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS: | ||
| raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}") | ||
|
|
||
| self._colocated_python = cfg.colocated_python.instantiate(bundler=bundler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should instantiate it only when colocated python is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still used when colocated python is disabled to get pathways_server_image and pathways_proxy_image.
It's already implemented in |
6eac867 to
f674e10
Compare
Co-authored-by: lkolluru05 <lkolluru@google.com>
f674e10 to
276cd87
Compare
Disabled by default. Can be enabled by passing
--bundler_spec=enable_colocated_python=Truewhen launching a Pathways job or LWS job. Both artifact registry and CloudBuild bundlers are supported.This PR is a refactored version of #1350