Skip to content

Conversation

@0405ysj
Copy link
Collaborator

@0405ysj 0405ysj commented Nov 5, 2025

Context: b/455678690

With configuring Cloud Orchestrator as below, Cloud Orchestrator with dockerIM can utilize nvidia GPU.

[InstanceManager.Docker]
GpuManufacturer = "nvidia"

@0405ysj 0405ysj marked this pull request as ready for review November 5, 2025 05:52
@0405ysj 0405ysj requested review from ikicha and k311093 and removed request for Databean, adelva1984 and rmuthiah November 5, 2025 05:53
im = instances.NewDockerInstanceManager(config.InstanceManager, cli)
im, err = instances.NewDockerInstanceManager(config.InstanceManager, cli)
if err != nil {
log.Fatal("Failed to create Docker Instance Manager: ", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should return (instances.Manager, error) if it can fail like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used log.Fatal to keep consistency, as other place in this file does. Would you want me to refactor here?

Copy link
Member

@ser-io ser-io left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with GCE counterpart and flexibility make the use of accelerators (gpu) part of the requests first. You can add configuration support later if needed.

Use #319 as reference.


type DockerIMConfig struct {
DockerImageName string
GpuManufacturer string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid terms like manufacturer that are not part of the docker documentation https://docs.docker.com/desktop/features/gpu. Try to use similar names used in the documentation that docker users are already familiar with

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is equivalant to --env NVIDIA_DRIVER_CAPABILITIES=all --gpus all --runtime nvidia in terms of executing docker run, and I don't think there's a proper name to represent my purpose on docker documentation.

--env and --runtime looks fine to expose into CO configuration, but --gpus in docker run directly specifies GPU allocation. I think --gpus shouldn't be exposed into CO configuration, since DockerIM runs multiple docker instances and we probably want to consider GPU allocation by CO later. I don't want to design how DockerIM allocates GPU right now, as it's pretty complicated to consider details of nvidia GPUs.

So, I need to define a new name to pass information whether DockerIM will utilize GPU or not. Retrieving such information by parsing --env or --runtime is appropriate. Or, considering boolean configuration such as UseNvidiaGpu looks valid to me. If CO configuration for GPU is representative enough to set --env or --runtime, I think we don't need to define new configurations in advance which can make compatibility issue in the far future.

Copy link
Member

@ser-io ser-io Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's solve #494 (comment) first.

@0405ysj 0405ysj requested a review from ser-io November 6, 2025 02:27

type DockerIMConfig struct {
DockerImageName string
GpuManufacturer string
Copy link
Member

@ser-io ser-io Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's solve #494 (comment) first.


type DockerIMConfig struct {
DockerImageName string
GpuManufacturer string
Copy link
Member

@ser-io ser-io Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to create docker instances using GPU should be part of the CO public API, not hidden as a CO configuration. Please explain your case why do you want to hide this ability from the end users.

For reference, the ability to add accelerators is part of the public API for GCE hosts. See #319. Also the gcloud and docker CLIs follow the same principle. Going the opposite way here should be properly justified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot say it's under same principle because of the way how docker run --runtime works. Valid values of --runtime flag relies on the dockerd configuration setup. At least, this isn't proper to be exposed into cvdr CLI, but should be in CO configuration.

$ sudo nvidia-ctk runtime configure --runtime=docker # Modifies /etc/docker/daemon.json with adding a new runtime.
$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
$ sudo systemctl restart docker
# Then users can execute `docker run --runtime nvidia [args]`

On the other hand, I think it's a bit complicated to make agreement from here... I'll propose a design around GPU utilization when I have time, perhaps with GPU allocation mechanism too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@0405ysj 0405ysj marked this pull request as draft November 7, 2025 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants