-
-
Notifications
You must be signed in to change notification settings - Fork 12.4k
[CI] Add CUDA 13 nightly containers #31822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for building and publishing nightly Docker containers for CUDA 13. This is achieved by adding new steps to the Buildkite release pipeline for building CUDA 13 images for x86 and arm64, creating a multi-arch manifest, and publishing them to DockerHub. The cleanup-nightly-builds.sh script has also been updated to accept a tag prefix, making it reusable for cleaning up different sets of nightly images.
The changes are logical and follow the existing structure of the pipeline. However, I have identified two high-severity issues in the pipeline configuration file:
- An invalid CUDA compute capability is specified in the
torch_cuda_arch_listfor the arm64 build. - There is significant code duplication in the pipeline steps, which harms maintainability. I've suggested using YAML anchors to refactor this.
Please see the detailed comments for suggestions on how to address these points.
| queue: arm64_cpu_queue_postmerge | ||
| commands: | ||
| - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cuda13-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The torch_cuda_arch_list contains 12.0, which is not a valid CUDA compute capability. The latest defined architecture is 9.0 for Hopper. While 10.0+PTX can be for forward-compatibility with upcoming architectures like Blackwell, 12.0 is likely a mistake. At best, it will be ignored by the build system, but it should be removed to avoid confusion and potential issues.
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 8.9 9.0 10.0+PTX' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cuda13-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."| - label: "Build and publish nightly CUDA 13.0 multi-arch image to DockerHub" | ||
| depends_on: | ||
| - create-multi-arch-manifest-cuda13 | ||
| if: build.env("NIGHTLY") == "1" | ||
| agents: | ||
| queue: cpu_queue_postmerge | ||
| commands: | ||
| - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" | ||
| - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cuda13-x86_64" | ||
| - "docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cuda13-aarch64" | ||
| - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cuda13-x86_64 vllm/vllm-openai:cuda13-nightly-x86_64" | ||
| - "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-cuda13-aarch64 vllm/vllm-openai:cuda13-nightly-aarch64" | ||
| - "docker push vllm/vllm-openai:cuda13-nightly-x86_64" | ||
| - "docker push vllm/vllm-openai:cuda13-nightly-aarch64" | ||
| - "docker manifest create vllm/vllm-openai:cuda13-nightly vllm/vllm-openai:cuda13-nightly-x86_64 vllm/vllm-openai:cuda13-nightly-aarch64 --amend" | ||
| - "docker manifest create vllm/vllm-openai:cuda13-nightly-$BUILDKITE_COMMIT vllm/vllm-openai:cuda13-nightly-x86_64 vllm/vllm-openai:cuda13-nightly-aarch64 --amend" | ||
| - "docker manifest push vllm/vllm-openai:cuda13-nightly" | ||
| - "docker manifest push vllm/vllm-openai:cuda13-nightly-$BUILDKITE_COMMIT" | ||
| # Clean up old CUDA 13.0 nightly builds (keep only last 14) | ||
| - "bash .buildkite/scripts/cleanup-nightly-builds.sh cuda13-nightly-" | ||
| plugins: | ||
| - docker-login#v3.0.0: | ||
| username: vllmbot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new step to publish nightly CUDA 13.0 images is almost a complete duplicate of the existing step for CUDA 12.9 images. This level of duplication makes the pipeline configuration difficult to maintain and prone to errors, as changes need to be manually synchronized across multiple blocks.
To improve maintainability, I strongly recommend refactoring this using YAML anchors and aliases. You can define a template for the common parts of the job and then reuse it for each CUDA version, only overriding the specific parts like dependencies and tag prefixes.
Here is a conceptual example:
.publish_nightly_template: &publish_nightly_template
if: build.env("NIGHTLY") == "1"
agents:
queue: cpu_queue_postmerge
plugins:
- docker-login#v3.0.0:
username: vllmbot
password-env: DOCKERHUB_TOKEN
# ... other common properties
- label: "Build and publish nightly CUDA 12.9 ..."
<<: *publish_nightly_template
depends_on:
- create-multi-arch-manifest
commands:
# ... commands with version-specific tags
- "bash .buildkite/scripts/cleanup-nightly-builds.sh nightly-"
- label: "Build and publish nightly CUDA 13.0 ..."
<<: *publish_nightly_template
depends_on:
- create-multi-arch-manifest-cuda13
commands:
# ... commands with version-specific tags
- "bash .buildkite/scripts/cleanup-nightly-builds.sh cuda13-nightly-"Even the commands list could be further parameterized using variables to reduce duplication even more. Adopting this pattern will make the pipeline much cleaner and easier to manage.
Purpose
release-pipeline.yaml).Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.