-
Notifications
You must be signed in to change notification settings - Fork 38
Introduce configuration to utilize nvidia GPU on dockerIM #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| im = instances.NewDockerInstanceManager(config.InstanceManager, cli) | ||
| im, err = instances.NewDockerInstanceManager(config.InstanceManager, cli) | ||
| if err != nil { | ||
| log.Fatal("Failed to create Docker Instance Manager: ", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should return (instances.Manager, error) if it can fail like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used log.Fatal to keep consistency, as other place in this file does. Would you want me to refactor here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with GCE counterpart and flexibility make the use of accelerators (gpu) part of the requests first. You can add configuration support later if needed.
Use #319 as reference.
|
|
||
| type DockerIMConfig struct { | ||
| DockerImageName string | ||
| GpuManufacturer string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid terms like manufacturer that are not part of the docker documentation https://docs.docker.com/desktop/features/gpu. Try to use similar names used in the documentation that docker users are already familiar with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is equivalant to --env NVIDIA_DRIVER_CAPABILITIES=all --gpus all --runtime nvidia in terms of executing docker run, and I don't think there's a proper name to represent my purpose on docker documentation.
--env and --runtime looks fine to expose into CO configuration, but --gpus in docker run directly specifies GPU allocation. I think --gpus shouldn't be exposed into CO configuration, since DockerIM runs multiple docker instances and we probably want to consider GPU allocation by CO later. I don't want to design how DockerIM allocates GPU right now, as it's pretty complicated to consider details of nvidia GPUs.
So, I need to define a new name to pass information whether DockerIM will utilize GPU or not. Retrieving such information by parsing --env or --runtime is appropriate. Or, considering boolean configuration such as UseNvidiaGpu looks valid to me. If CO configuration for GPU is representative enough to set --env or --runtime, I think we don't need to define new configurations in advance which can make compatibility issue in the far future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's solve #494 (comment) first.
|
|
||
| type DockerIMConfig struct { | ||
| DockerImageName string | ||
| GpuManufacturer string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's solve #494 (comment) first.
|
|
||
| type DockerIMConfig struct { | ||
| DockerImageName string | ||
| GpuManufacturer string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ability to create docker instances using GPU should be part of the CO public API, not hidden as a CO configuration. Please explain your case why do you want to hide this ability from the end users.
For reference, the ability to add accelerators is part of the public API for GCE hosts. See #319. Also the gcloud and docker CLIs follow the same principle. Going the opposite way here should be properly justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot say it's under same principle because of the way how docker run --runtime works. Valid values of --runtime flag relies on the dockerd configuration setup. At least, this isn't proper to be exposed into cvdr CLI, but should be in CO configuration.
$ sudo nvidia-ctk runtime configure --runtime=docker # Modifies /etc/docker/daemon.json with adding a new runtime.
$ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
$ sudo systemctl restart docker
# Then users can execute `docker run --runtime nvidia [args]`
On the other hand, I think it's a bit complicated to make agreement from here... I'll propose a design around GPU utilization when I have time, perhaps with GPU allocation mechanism too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Context: b/455678690
With configuring Cloud Orchestrator as below, Cloud Orchestrator with dockerIM can utilize nvidia GPU.