-
Notifications
You must be signed in to change notification settings - Fork 251
feat: onboard NVIDIA GPU support for ACL #8112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -21,12 +21,25 @@ downloadSysextFromVersion() { | |||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| matchLocalSysext() { | ||||||||||||||||||||||||||||||||||||||||||||||||||
| local seName=$1 desiredVer=$2 seArch=$3 | ||||||||||||||||||||||||||||||||||||||||||||||||||
| printf "%s\n" "/opt/${seName}/downloads/${seName}-v${desiredVer}"[.~-]*"-${seArch}.raw" | sort -V | tail -n1 | ||||||||||||||||||||||||||||||||||||||||||||||||||
| local downloadDir="/opt/${seName}/downloads" | ||||||||||||||||||||||||||||||||||||||||||||||||||
| # Try arch-specific versioned filename first (kubelet-style: name-vVER.X-arch.raw) | ||||||||||||||||||||||||||||||||||||||||||||||||||
| local match | ||||||||||||||||||||||||||||||||||||||||||||||||||
| match=$(find "${downloadDir}" -maxdepth 2 -name "${seName}-v${desiredVer}*-${seArch}.raw" -type f 2>/dev/null | sort -V | tail -n1) | ||||||||||||||||||||||||||||||||||||||||||||||||||
| if [ -f "${match}" ]; then | ||||||||||||||||||||||||||||||||||||||||||||||||||
| echo "${match}" | ||||||||||||||||||||||||||||||||||||||||||||||||||
| return | ||||||||||||||||||||||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||||||||||||||||||||||
| # Fallback: GPU sysexts are downloaded as simple name.raw (e.g. nvidia-driver-vgpu.raw). | ||||||||||||||||||||||||||||||||||||||||||||||||||
| # MCR artifacts may place files in an arch subdirectory (e.g. amd64/name.raw), | ||||||||||||||||||||||||||||||||||||||||||||||||||
| # so search up to 2 levels deep. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| match=$(find "${downloadDir}" -maxdepth 2 -name "${seName}.raw" -type f 2>/dev/null | head -n1) | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+33
to
+35
|
||||||||||||||||||||||||||||||||||||||||||||||||||
| # MCR artifacts may place files in an arch subdirectory (e.g. amd64/name.raw), | |
| # so search up to 2 levels deep. | |
| match=$(find "${downloadDir}" -maxdepth 2 -name "${seName}.raw" -type f 2>/dev/null | head -n1) | |
| # Prefer an arch-specific subdirectory (${downloadDir}/${seArch}) when present, | |
| # then fall back to an arch-neutral file directly under ${downloadDir}. In both | |
| # cases, pick the highest version deterministically. | |
| match=$(find "${downloadDir}/${seArch}" -maxdepth 1 -name "${seName}.raw" -type f 2>/dev/null | sort -V | tail -n1) | |
| if [ -f "${match}" ]; then | |
| echo "${match}" | |
| return | |
| fi | |
| match=$(find "${downloadDir}" -maxdepth 1 -name "${seName}.raw" -type f 2>/dev/null | sort -V | tail -n1) |
Copilot
AI
Mar 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated regex allows both arch-specific tags and “exact version” tags, but the single-pass sort -V | tail -n1 selection can become ambiguous if both forms exist (it may choose the wrong one depending on tag set/sort behavior). To make selection deterministic, consider doing a two-pass lookup: first attempt the arch-specific pattern; only if that yields no match, fallback to the exact-version tag.
| # Match either arch-specific tags (v{ver}[.~-]*-azlinux3-{arch}) or exact version tags ({ver}) | |
| retrycmd_silent 120 5 20 oras repo tags --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "${seURL}" | grep -Ex "(v${desiredVer//./\\.}[.~-].*-azlinux3-${seArch}|${desiredVer//./\\.})" | sort -V | tail -n1 | |
| test ${PIPESTATUS[0]} -eq 0 | |
| local tags archPattern exactPattern match | |
| # Fetch all tags once; retrycmd_silent handles retries and logging. | |
| tags=$(retrycmd_silent 120 5 20 oras repo tags --registry-config "${ORAS_REGISTRY_CONFIG_FILE}" "${seURL}") | |
| if [ $? -ne 0 ]; then | |
| # Propagate failure from oras/registry access. | |
| return 1 | |
| fi | |
| # First pass: prefer arch-specific tags (v{ver}[.~-]*-azlinux3-{arch}). | |
| archPattern="^v${desiredVer//./\\.}[.~-].*-azlinux3-${seArch}$" | |
| match=$(printf '%s\n' "${tags}" | grep -E "${archPattern}" | sort -V | tail -n1) | |
| if [ -n "${match}" ]; then | |
| echo "${match}" | |
| return 0 | |
| fi | |
| # Second pass: fall back to exact-version tags ({ver}) if no arch-specific tag exists. | |
| exactPattern="^${desiredVer//./\\.}$" | |
| match=$(printf '%s\n' "${tags}" | grep -E "${exactPattern}" | sort -V | tail -n1) | |
| echo "${match}" |
henryli001 marked this conversation as resolved.
Show resolved
Hide resolved
Copilot
AI
Mar 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should_use_nvidia_open_drivers returns 2 specifically for “unable to determine VM SKU”, but this path exits with ERR_MISSING_CUDA_PACKAGE, which is misleading and can cause incorrect failure categorization/telemetry. Prefer propagating the function’s error (or add a dedicated error code like ERR_GPU_DRIVER_SELECTION_FAIL) and emit an error message that matches the underlying cause (e.g., IMDS SKU lookup failure).
| echo "Failed to determine GPU driver type" | |
| exit $ERR_MISSING_CUDA_PACKAGE | |
| echo "Failed to determine GPU driver type for this VM: unable to determine VM SKU (should_use_nvidia_open_drivers returned ${driver_ret})" | |
| exit "${driver_ret}" |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -851,4 +851,65 @@ datasource: | |||||||||||||
| EOF | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| # ==== GPU driver functions ==== | ||||||||||||||
| # Shared between Azure Linux (Mariner) and ACL distro install scripts. | ||||||||||||||
| # These functions are only invoked on GPU-enabled VM SKUs during provisioning; | ||||||||||||||
| # they are safe to define on all distros (no execution at source time). | ||||||||||||||
|
|
||||||||||||||
| should_use_nvidia_open_drivers() { | ||||||||||||||
| # Checks if the VM SKU should use NVIDIA open drivers (vs proprietary drivers). | ||||||||||||||
| # Legacy GPUs (T4, V100) use NVIDIA proprietary drivers; A100+ use NVIDIA open drivers. | ||||||||||||||
| # Returns: 0 (true) for open drivers, 1 (false) for proprietary drivers, 2 on error | ||||||||||||||
| local vm_sku | ||||||||||||||
| vm_sku=$(get_compute_sku) | ||||||||||||||
| if [ -z "$vm_sku" ]; then | ||||||||||||||
| echo "Error: Unable to determine VM SKU, cannot select GPU driver" >&2 | ||||||||||||||
| return 2 | ||||||||||||||
| fi | ||||||||||||||
| local lower="${vm_sku,,}" | ||||||||||||||
|
|
||||||||||||||
| # T4 GPUs (NC*_T4_v3 family) use proprietary drivers | ||||||||||||||
| # V100 GPUs: NDv2 (nd40rs_v2), NDv3 (nd40s_v3), NCsv3 (nc*s_v3) use proprietary drivers | ||||||||||||||
| case "$lower" in | ||||||||||||||
| *t4_v3*) | ||||||||||||||
| return 1 | ||||||||||||||
| ;; | ||||||||||||||
| *nd40rs_v2*) | ||||||||||||||
| return 1 | ||||||||||||||
| ;; | ||||||||||||||
| *nd40s_v3*) | ||||||||||||||
| return 1 | ||||||||||||||
| ;; | ||||||||||||||
| standard_nc*s_v3*) | ||||||||||||||
| return 1 | ||||||||||||||
| ;; | ||||||||||||||
| esac | ||||||||||||||
|
|
||||||||||||||
| # All other GPU SKUs (A100+) use open drivers | ||||||||||||||
| return 0 | ||||||||||||||
| } | ||||||||||||||
|
|
||||||||||||||
| enableNvidiaPersistenceMode() { | ||||||||||||||
djsly marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| PERSISTENCED_SERVICE_FILE_PATH="/etc/systemd/system/nvidia-persistenced.service" | ||||||||||||||
| touch ${PERSISTENCED_SERVICE_FILE_PATH} | ||||||||||||||
| cat << EOF > ${PERSISTENCED_SERVICE_FILE_PATH} | ||||||||||||||
|
Comment on lines
+893
to
+895
|
||||||||||||||
| PERSISTENCED_SERVICE_FILE_PATH="/etc/systemd/system/nvidia-persistenced.service" | |
| touch ${PERSISTENCED_SERVICE_FILE_PATH} | |
| cat << EOF > ${PERSISTENCED_SERVICE_FILE_PATH} | |
| local PERSISTENCED_SERVICE_FILE_PATH="/etc/systemd/system/nvidia-persistenced.service" | |
| touch "${PERSISTENCED_SERVICE_FILE_PATH}" | |
| cat << EOF > "${PERSISTENCED_SERVICE_FILE_PATH}" |
Copilot
AI
Mar 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enableNvidiaPersistenceMode calls systemctl enable/restart and exit 1 on failure. Since this function is now shared and used for ACL, exiting with a generic code loses the repo’s standardized error codes and skips the retry/timeout wrappers (systemctlEnableAndStart, systemctl_*). Consider using the helper wrappers and returning a specific error code (e.g. ERR_SYSTEMCTL_START_FAIL) so failures are actionable in CSE telemetry.
| systemctl enable nvidia-persistenced.service || exit 1 | |
| systemctl restart nvidia-persistenced.service || exit 1 | |
| if ! systemctlEnableAndStart nvidia-persistenced.service; then | |
| return $ERR_SYSTEMCTL_START_FAIL | |
| fi |
Uh oh!
There was an error while loading. Please reload this page.