Skip to content

Conversation

@umfranci
Copy link
Collaborator

  • The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
  • This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
  • Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
  • Suggested Fix:
    • Primary detection: Continue using the existing hardcoded GPU list for known models
    • Fallback mechanism: When no matches are found in the hardcoded list:
      • Group VMBus devices by their last segment (device ID suffix)
      • Identify GPU device groups where all entries are marked as "PCI Express pass-through"
      • Validate the count matches nvidia-smi output for accuracy
    • Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants