Conversation
40b763c to
3c0ef4f
Compare
| @@ -0,0 +1,5 @@ | |||
| # Checkbox Provider - NPU | |||
There was a problem hiding this comment.
Is there anything that should be done when creating a new provider? CI-wise, packaging-wise?
There was a problem hiding this comment.
Yes, indeed. It will require to add at least the provider into the tox-checkbox.yaml and create a tox file for this provider.
Nevertheless, I discussed this morning with Max and, since these tests are fairly simple, and they are all under a new manifest entry, we could add them directly to the base provider.
I will finish the review to make sure that's the case, but that way we won't need to create a new provider.
There was a problem hiding this comment.
I have been told by @farshidtz that to run the coverage and different python version tests, the provider needs to be added in the tox-checkbox workflow.
However, I am not sure if there is more to do for a new provider.
| estimated_duration: 2s | ||
| command: check_accel_permissions.py | ||
| imports: from com.canonical.plainbox import manifest | ||
| requires: manifest.has_npu == 'True' |
There was a problem hiding this comment.
I think that new manifest entries must now be submitted to the certification team to be added to C3's feature set
There was a problem hiding this comment.
C3 gets them from https://github.com/canonical/blueprints so you could also make a PR there, I think.
|
@fernando79513 and @tomli380576, can you please review this PR? |
pseudocc
left a comment
There was a problem hiding this comment.
Just a passersby reviewer, see inline comments.
| @@ -0,0 +1,5 @@ | |||
| # Checkbox Provider - NPU | |||
|
|
|||
| This provider includes tests for devices with an NPU. As of right now, it is intended only for Intel NPUs. The tests only run as long as the manifest entry `has_npu` is set to `true`. | |||
There was a problem hiding this comment.
Should check this instead of defining has_npu
modinfo -Falias intel_vpu devenv-shell-env
pci:v00008086d0000FD3Esv*sd*bc*sc*i*
pci:v00008086d0000B03Esv*sd*bc*sc*i*
pci:v00008086d0000643Esv*sd*bc*sc*i*
pci:v00008086d0000AD1Dsv*sd*bc*sc*i*
pci:v00008086d00007D1Dsv*sd*bc*sc*i*
There was a problem hiding this comment.
Do you mean there should be no manifest entry at all ?
I have now added a job to check modinfo output, is that what you had in mind?
There was a problem hiding this comment.
I was thinking about reading the modinfo output in a resource job, but now I think it may not be the ideal way.
Say we have:
- 6.17 kernel and a Panther Lake CPU (modaliases for Panther Lake were added in 6.18).
Then all NPU tests would be skipped, which shouldn't be what we expect. Let's just keep the manifest implementation; it's good.
There was a problem hiding this comment.
Yes, I think in these cases it's better to go to the manifest. If we know if a device should test for NPU(via the manifest), then we can easily detect an error in case the NPU is not detected correctly.
There was a problem hiding this comment.
Test scripts lgtm! Could you provide a bit of documentation on what the expectations are for NPU_UMD_TEST_CONFIG like where it's supposed to be placed at, permissions, what exactly are the expected contents (for example the tree output of a correct setup), etc.
One small question: is the driver snap expected to come preinstalled? If not, I think we should print an error somewhere or mention it in the manifest.
|
|
||
|
|
||
| def main(): | ||
| config_path = os.environ.get("NPU_UMD_TEST_CONFIG") |
There was a problem hiding this comment.
I think NPU_UMD_TEST_CONFIG also needs to be an absolute path since it's passed to dirname in one of the jobs
assert Path(config_path).is_absolute()There was a problem hiding this comment.
This file (along with the model file) might have to be placed inside the driver snap's current directory or the umd tests will say "the file is bad" and trigger an apparmor deny message. For example if I put the config file in $HOME, this happens:
[Tue Dec 23 10:47:03 2025] audit: type=1400 audit(1766458023.464:331): apparmor="DENIED" operation="open" class="file" profile="snap.intel-npu-driver.npu-umd-test" name="/home/ubuntu/basic.yaml" pid=31451 comm="npu-umd-test" requested_mask="r" denied_mask="r" fsuid=1000 ouid=1000
There was a problem hiding this comment.
Currently this file is placed in the current directory by the script which installs the intel-npu-driver snap. The file is pretty much static but we're planning to have the file located directly inside the intel-npu-driver snap (since the format could change between versions of npu-umd-test which is already distributed as a binary in the intel-npu-driver snap).
There was a problem hiding this comment.
Does this mean the model files would be bundled with the snap in the future? If so, I think you can use the path of the bundled files as the default inside the test case and only use the environment variable as an override. This would also make it easier to run the test since we won't have to specify NPU_UMD_TEST_CONFIG every time.
There was a problem hiding this comment.
The config file as well as the model files are now bundled with the intel-npu-driver snap on the latest/edge channel. The default config and model files are used if you specify --config=basic.yaml.
There was a problem hiding this comment.
I have updated the code to make the NPU_UMD_TEST_CONFIG optional.
The snap is not pre-installed but ideally the devices that do have the has_npu manifest entry should install the snap before running checkbox... Is there any way in checkbox to define this dependency and maybe even have the snap auto-installed by checkbox? |
pseudocc
left a comment
There was a problem hiding this comment.
Looks good to me now, thanks!
|
To make this case depend on whether the driver snap exists, add |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2198 +/- ##
==========================================
+ Coverage 53.34% 55.83% +2.49%
==========================================
Files 399 422 +23
Lines 42907 45417 +2510
Branches 7945 8194 +249
==========================================
+ Hits 22887 25358 +2471
- Misses 19214 19287 +73
+ Partials 806 772 -34
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I see, I have actually avoided using the dependency exactly so that it would be an explicit failure if the snap is not present but As a side note, I think it would be useful to have something like Ansible's |
fernando79513
left a comment
There was a problem hiding this comment.
Sorry for taking so long to review it. Thanks a lot for all the effort.
I will continue the review on Monday, but I'm leaving some comments in the meantime.
| raise SystemExit( | ||
| "The loaded firmware does not match any version in the snap files." | ||
| ) |
There was a problem hiding this comment.
Same comment as before. If the versions mismatch, it will be interesting to log the loaded version and the version for the snap to investigate the issue.
There was a problem hiding this comment.
I have now updated it to add the loaded version and the versions found in the snap in the error message.
| @@ -0,0 +1,5 @@ | |||
| # Checkbox Provider - NPU | |||
There was a problem hiding this comment.
Yes, indeed. It will require to add at least the provider into the tox-checkbox.yaml and create a tox file for this provider.
Nevertheless, I discussed this morning with Max and, since these tests are fairly simple, and they are all under a new manifest entry, we could add them directly to the base provider.
I will finish the review to make sure that's the case, but that way we won't need to create a new provider.
| # This can happen with corrupted files or if 'strings' isn't installed | ||
| return None | ||
| return None |
There was a problem hiding this comment.
No need to retunr None in python.
I would log here the error though, to know that it could not read any file
| # This can happen with corrupted files or if 'strings' isn't installed | |
| return None | |
| return None | |
| print("Version pattern not found in file") |
There was a problem hiding this comment.
I have made this raise SystemExit now.
| except ( | ||
| IOError, | ||
| FileNotFoundError, | ||
| ): |
There was a problem hiding this comment.
Nitpick, I just feel it's more readable this way.
| except ( | |
| IOError, | |
| FileNotFoundError, | |
| ): | |
| except (IOError, FileNotFoundError): |
| try: | ||
| # Find all sequences of one or more digits in the string | ||
| numbers = re.findall( | ||
| r"\d+", | ||
| version_str, | ||
| ) | ||
| if not numbers: | ||
| raise ValueError("No version numbers found in the string.") | ||
|
|
||
| # Convert the list of number strings to a tuple of integers | ||
| return tuple( | ||
| map( | ||
| int, | ||
| numbers, | ||
| ) | ||
| ) | ||
| except ( | ||
| ValueError, | ||
| TypeError, | ||
| ): | ||
| return None |
There was a problem hiding this comment.
Nitpick: This may be an autoformatting config thing, but this 20 lines could be written in just 9, with a more readable structure (and more consistent with the rest of the checkbox codebase)
Maybe it's an issue with the trailing commas after each list/touple item?
I saw you have already fixed it in some parts of the code.
| try: | |
| # Find all sequences of one or more digits in the string | |
| numbers = re.findall( | |
| r"\d+", | |
| version_str, | |
| ) | |
| if not numbers: | |
| raise ValueError("No version numbers found in the string.") | |
| # Convert the list of number strings to a tuple of integers | |
| return tuple( | |
| map( | |
| int, | |
| numbers, | |
| ) | |
| ) | |
| except ( | |
| ValueError, | |
| TypeError, | |
| ): | |
| return None | |
| try: | |
| # Find all sequences of one or more digits in the string | |
| numbers = re.findall(r"\d+", version_str) | |
| if not numbers: | |
| raise ValueError("No version numbers found in the string.") | |
| # Convert the list of number strings to a tuple of integers | |
| return tuple(map(int, numbers)) | |
| except (ValueError, TypeError): | |
| return None |
There was a problem hiding this comment.
I'm pretty sure this formatting comes from running black on it, I don't use any other autoformatter. It could be wrong comma placement though, I'll check if I can make it format it in less lines.
There was a problem hiding this comment.
Considering how many of these formattings are there I wonder if I haven't given black wrong line length at some point...
There was a problem hiding this comment.
I think it's putting trailing commas by default. I also use black, but when I remove the trailing comma on list/touples, I get a more reasonable result.
There was a problem hiding this comment.
Yeah I somehow think the commas were also added by black, but honestly can't be sure anymore. I think it stemmed somehow from the fact that black considers 80-82 characters still fine for -l 79 but then the flake8 test fails and so I've tried to put smaller numbers to black to make the lines actually max 79 characters long.
Anyways, I've reformatted as much as I saw that could be fit one less lines now 🙂
| result = subprocess.run( | ||
| [ | ||
| "journalctl", | ||
| "--dmesg", | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| encoding="utf-8", | ||
| ) |
There was a problem hiding this comment.
It should be okay just using subprocess.check_output(). Also I don't think "encoding=" is available for python 3.5, but I don't think we are ever going to run these tests in core16.
| result = subprocess.run( | |
| [ | |
| "journalctl", | |
| "--dmesg", | |
| ], | |
| capture_output=True, | |
| text=True, | |
| check=True, | |
| encoding="utf-8", | |
| ) | |
| result = subprocess.check_output( | |
| ["journalctl", "--dmesg"], universal_newlines=True, encoding="utf-8" | |
| ) |
There was a problem hiding this comment.
I have updated all of the calls to use check_output now. And I don't think the encoding parameter is necessary there anyways.
| result = subprocess.run( | ||
| [ | ||
| "strings", | ||
| filepath, | ||
| ], | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| encoding="utf-8", | ||
| ) |
There was a problem hiding this comment.
| result = subprocess.run( | |
| [ | |
| "strings", | |
| filepath, | |
| ], | |
| capture_output=True, | |
| text=True, | |
| check=True, | |
| encoding="utf-8", | |
| ) | |
| result = subprocess.check_output( | |
| ["strings", filepath], universal_newlines=True, encoding="utf-8" | |
| ) |
There was a problem hiding this comment.
This is now updated. 👍
Remove unreachable return Co-authored-by: Fernando Bravo <39527354+fernando79513@users.noreply.github.com>
fernando79513
left a comment
There was a problem hiding this comment.
I left a few more comments on the PR.
My main concern is with the "non-blocker" jobs. I am not sure under which conditions you expect them to fail, but we should not add them to the test plan if they are not expected to pass.
Apart from that, could you add some submissions of devices in the lab running the NPU tests with the new provider?
| current_version = parse_version(current_version_str) | ||
| required_version = parse_version(args.required_version) |
There was a problem hiding this comment.
We have packaging.version available. If you import from packaging import version you can use it to parse kernel versions, in a similar way to this script.
| current_version = parse_version(current_version_str) | |
| required_version = parse_version(args.required_version) | |
| current_version = version.Version(current_version_str.rsplit('-', 1)[0])) | |
| required_version = version.Version(current_version_str.rsplit('-', 1)[0])) |
It would be great to move this functionality to checkbo_ support/helpers, but I don't think we are including the packaging module there.
We could have a simple kernel version parser there thoug:
def parse_kernel_version(version_str):
version, flavor = version_str.rsplit("-", 1)
return version, flavor
| known_failures = ( | ||
| subprocess.run( | ||
| ["intel-npu-driver.known-failures"], | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.PIPE, | ||
| universal_newlines=True, | ||
| ) | ||
| .stdout.strip() | ||
| .splitlines() | ||
| ) |
There was a problem hiding this comment.
This is a bit simpler:
| known_failures = ( | |
| subprocess.run( | |
| ["intel-npu-driver.known-failures"], | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE, | |
| universal_newlines=True, | |
| ) | |
| .stdout.strip() | |
| .splitlines() | |
| ) | |
| known_failures = ( | |
| subprocess.check_output( | |
| ["intel-npu-driver.known-failures"], | |
| universal_newlines=True, | |
| ) | |
| .strip() | |
| .splitlines() | |
| ) |
There was a problem hiding this comment.
Thanks, I changed all of the subprocess.run to subprocess.check_output 👍
| help=" 'blocker' prints only tests that aren't known failures," | ||
| "'non-blocker' prints only known failures.", |
There was a problem hiding this comment.
I think the naming is confusing here. We don't expect any of the tests in checkbox to fail in normal circumstances, even if they are "non-blocker" tests.
If they only pass some of the time, just call them fleaky. If they only pass on certain architectures, use another resource and filter them. The "blocker"/"non-blocker" category should be reserved only for the certification process.
There was a problem hiding this comment.
They are known failures, the list is already filtered by cpu generation and running kernel version. However the failures might get fixed by Intel in future driver versions and that's why it would be useful to be able to have the information about if they passed/failed on a certain version.
There was a problem hiding this comment.
If you just want to keep this information from run to run, you could just have an informational job.
You could have a job that runs all these tests that are expected to fail and logs their results, without the actual job failing. You should make it clear in the tests this purpose.
something like:
id: tracked_known_issues
category_id: intel_npu
flags: simple
_description: Keeps track of tests with known failures that may be fixed in the future.
estimated_duration: 2s
command: failed_tests_info
| ) | ||
|
|
||
| FIRMWARE_SEARCH_DIR = Path("/var/snap/intel-npu-driver/current/intel/vpu") | ||
| VERSION_PATTERN = re.compile(r"^(\d{8}\*|[A-Z][a-z]{2}\s+\d{1,2}\s+\d{4}\*).*") |
There was a problem hiding this comment.
This regex pattern is quite hard to read. If we can't avoid using it (in favor of builtin python functions), at least try to explain what it's trying to achieve and maybe add an example.
| plugin: resource | ||
| id: intel_npu_gtest_resource_non_blocker | ||
| category_id: intel_npu | ||
| _summary: Produces a resource per test of Intel NPU Umd test | ||
| _description: Creates a resource for accessing tests for NPU drivers. | ||
| imports: from com.canonical.plainbox import manifest | ||
| requires: manifest.has_npu == 'True' | ||
| command: intel_npu_gtest_resource.py -m non-blocker |
There was a problem hiding this comment.
As I said before. If they are known failures, either you make sure you run them only on the environments where they always pass, or you add the results as an informational job. Jobs that are included in the test plan are expected to pass if the machine has no issues.
There was a problem hiding this comment.
or you add the results as an informational job
What do you mean with this? Exclude it from the test plan and just keep it as an optional job that could be included by another test plan?
There was a problem hiding this comment.
Yes, that was my approach. But after your previous comment, I think we should not have them in any test plan. If we decide to include any of these tests for a specific platform, we can add a "platform" option in intel_npu_gtest_resource.py to filter them.
| plugin: resource | ||
| _description: Creates resource for firwmare version logging support in dmesg | ||
| estimated_duration: 2s | ||
| command: min_kernel_version.py 6.8 |
There was a problem hiding this comment.
If you are not getting any information from the resource job, apart from whether it passes or not, you could get rid of the resource altogether and just have this check as a dependency of the rest of the NPU tests:
id: kmd/char_device_permissions
...
depends: info/kernel_config
id: check_min_kernel_version
category_id: intel_npu
flags: simple
_description: Check if the firmware version is supported
estimated_duration: 2s
command: min_kernel_version.py 6.8
There was a problem hiding this comment.
If you are not getting any information from the resource job, apart from whether it passes or not
Can you please clarify what you mean here? The resource job always passes but sets the resource to either supported or unsupported, not in the sense if the driver is supported or not but if getting the firmware version out of dmesg is supported. The job kmd/firmware_version is requiring that the state is supported because it inspects dmesg logs to check the loaded firmware version and that it matches the one in the tested snap. If getting the fw version from dmesg is not supported, the job kmd/firmware_version would fail but the driver is still expected to work also with kernels older than 6.8.
|
Also, I have a couple of questions.
|
@fernando79513 Thank you for your review! The
Sure, I'll upload those. |
Description
This PR adds a provider for testing devices with an NPU.
It is currently focused on Intel NPUs which use the accel kernel interface.
The driver for these NPUs is distributed through the
intel-npu-driversnap which also includes a gtest-based testing utilitynpu-umd-test. The tests in this provider check the appropriate firmware version is loaded, the user has the appropriate permissions and the rest runs individual tests from thenpu-umd-testutility.The latest version of the
intel-npu-driversnap now includes also the configuration file and model files necessary for running thenpu-umd-testas well as aknown-failurescommand which I've used to simplify the test filtering. The configuration file and list of known failures will be maintained as part of theintel-npu-driversnap as they can change from one version to another.Known issues
Some of the test names coming from the
npu-umd-testtest suite are longer than 80 characters which triggers a warning in checkbox.Tests
Tests have been run on Meteor Lake and Arrow Lake devices.