tools/ci: Granular build selection tool#18500
tools/ci: Granular build selection tool#18500linguini1 wants to merge 2 commits intoapache:masterfrom
Conversation
This script outputs the minimum list of defconfigs to build in order to completely test a change set. Right now it is not integrated anywhere, but it will eventually be integrated into CI in order to cut down on unnecessary build runs. Signed-off-by: Matteo Golin <matteo.golin@gmail.com>
Included documentation for a new tool, select.py, and also added a section for CI tools to be filled in. Signed-off-by: Matteo Golin <matteo.golin@gmail.com>
|
NOTE: the reason for the |
cederom
left a comment
There was a problem hiding this comment.
- Very cool thank you @linguini1 :-)
- We could test only changed platforms and save on CI use that way!
- We still need to verify whole ecosystem daily to see if no regressions were introduced.. and this can be done by a separate CI action right? :-)
lupyuen
left a comment
There was a problem hiding this comment.
This tools is generally OK right now. But later on: Touching the CI Workflow will get really scary, I hope we can find more ways to de-risk the Upcoming CI Change...
(1) How do we prove that this works for all possible PRs? Will we run it on past PRs to compare the results?
(2) Can we deploy this in stages: e.g. Arm64 first? Or target a specific subset of Complex PRs that happen often, but are really supposed to be Simple PRs?
(3) Maybe this tool is more helpful for NuttX Apps? Because ANY change to NuttX Apps will trigger a 3.5-hour build. Which really pains me.
(4) How will we monitor this Upcoming CI Change? Make sure that all PRs are actually built and tested correctly?
(5) This tool falls into the "Code Quality Trap" that I explained in my video. We have to be 100% sure that we don't miss out any builds and tests, that will lead to NuttX Breakage later.
(6) I also wonder: Are we submitting the tool too prematurely? Should we follow the agile / iterative way: Make it work on a Simulated NuttX Repo, complete with CI Changes? Then refactor the tool based on the actual tested code? Because many things can go wrong as we're doing auto-dependency-analysis of PRs. Thanks :-)
|
|
@lupyuen what do you propose to verify the change? To test it on the @linguini1 repo in the first place? When all works as expected over there then merge it here? Will free account bare this king of load? :-) @linguini1 would it be possible to easily / quickly revert if anything goes wrong? :-P |
|
This PR is missing the most important thing: how to integrate it with CI and whether it is possible at all. I think it's better to wait for a PoC of a working selector integrated with CI, otherwise this tool doesn't make much sense to merge |
|
CI build failed, restarted. |
|
I guess this change is safe as it is only preparation step and does not touch the current CI.. this way @linguini1 can test it and develop further on his local fork right? :-) |
Ehhhh I'm planning to discuss this in my next video... (1) Do we have actual stats on how often this really happens? How many GitHub Runners will we actually save? (2) When we kill builds too early: Will it cause our devs to resubmit PRs again and again? Consuming even more GitHub Runners? Do we have any predictive stats? (3) I have been testing extremely impactful PRs in my own repos, like https://github.com/lupyuen10/nuttx. So far I have no problems, just create a new (free) org, rename a few things, and it builds OK! Why won't this work for All PRs? Am I missing something, that's preventing our devs from doing this on their own PRs? Is it too complicated? Or because nobody documented this "secret hack"? |
Yes we should test this on a Free GitHub Org, similar to https://github.com/lupyuen10/nuttx |
|
Thanks for the feedback everyone! I guess the goal of this PR was:
But, I think the concerns are right that it is probably better to test this tool in the integrated CI before merging it; maybe that will catch corner cases earlier. I will use Lup's handy trick to test this on my own fork beforehand :) @lupyuen to answer your questions:
@cederom if we follow Lup's suggestion of phasing in this tool one arch at a time, then it should be easy to roll back if something goes terribly wrong. Hopefully by that point it will have been significantly tested. |
|
@linguini1 Awesome thanks! Another thing that really bugs me: Downloading the ESP32 Runtime fails too often. Which also wastes our GitHub Runners. I explained this in my other video: https://www.youtube.com/watch?v=lwkMS_bgyXA Since you're testing a new CI, it might be good to fix the ESP32 Runtime Downloading too. @simbit18 might have some brilliant ideas. Thanks :-) |
|
The main problem I see with this tool is that CI runs builds in parallel, and this is configured via YAML and handled by github actions. All this happens automatically and I don't know if it's possible to fine-tune the built targets without completely reworking the CI. EDIT: probably we could inject your script here: nuttx/.github/workflows/build.yml Lines 222 to 228 in f4a6e62 or integrate it with |
|
Or we can hack arch.yml. Though the Build Selection will not be 100% specific. For example: "Build only arm-01, but skip arm-02 to arm-14, because they are not impacted". And this might be less risky, I think? 🤔 |
|
maybe we should integrate this tool directly in |
|
another idea that is probably the least risky is split |
Okay this may be the time we have a dedicated MIRROR repo for all dependencies! That would be perfect for NXDART, because we only access network once, then just use offline packages from our another repo? This should also remove download fails :-) |
Well then our ESP32 Colleagues might complain that our Mirror Repo is stale and doesn't have the latest versions of all dependencies. Or do we use the Mirror Repo as an Always-Available Backup for fetching dependencies? |
Hmm, maybe we should only work on release packages then? :-P Or use fetch command that would first check on our mirror when that fails try the public connection? Something like that would constitute our self-contained ecosystem :-) |
|
I have a hunch about the ESP32 Download Failure: We are running too many jobs in parallel, which will fire Many Concurrent HTTP Requests to download the ESP32 Runtime (via github.com). And our Concurrent Downloads will get blocked by github.com for suspected spam. Maybe we should chat with our ESP32 Colleagues: Do they host the ESP32 Runtime outside of GitHub? Preferably on Multiple Hostnames, so we don't spam One Single Server? (Or can they offer to host ESP32 Runtime elsewhere?) Fixing the ESP32 Downloads is a terrific start. I see it failing all the time on NuttX Dashboard (sigh) |
Good point @lupyuen !! From my experience single file download (release package) it treated a lot nicer than many small files in parallel by any hosting provider at server / network level. Some time ago on our website github upload artifacts was updated to transfer package rather than all files separately. This also improves transfer time surprisingly. I did some web application that worked fine on the web. Then I added mobile application with REST API that did sequence of requests and that worked fine. Then I tried to improve synchronization speeds by moving requests to parallel and almost always there are failures and retires with delay are necessary. When you hit connection limit quota then delay is almost always necessary. Quick fix would be to add DELAY + RETRY option to all fetchers. But perfect situation would be to have single package fetch + delays + retries. If these are here maybe only some tune is enough :-) |
cederom
left a comment
There was a problem hiding this comment.
CI build fails again, need to investigate :-)
|
Hi everyone! Possible solution (to be verified, of course) to test and not modify the existing process. ./cibuild.sh -c -A -N -R -S workspace/citest/${{matrix.boards}}.dat The folder outside the nuttxworkspace/citest tree could also be useful for locally testing boards under development or other things, manually adding the file with the boards to build. Questions: Will this system on GitHub only work for Linux (Docker)? How can we know if it is possible to build a board on macOS, MSVC, or MSYS2 ? Will these jobs be started in any case? What should we do if there are changes to a board and other types, such as nuttx/drivers? What should the tool generate? Of course, the questions don't end there! But I have to preserve my brain cells :) |
|
We also have unused https://github.com/apache/nuttx-testing where we can stage some tests before they land in here :-) |
There was a problem hiding this comment.
@linguini1 I think this is a great idea!
I think the next step is to map all drivers and fs and build only board configs that are using it, i.e.:
If a PR modified the file "fs/fat/fs_fat32.c" then the script could inspect the "fs/fat/CMakeLists.txt" and figure out that this file is compiled when CONFIG_FS_FAT is enabled, then it could track which board configs are using it.
I'm not sure if it is easy to do, but comparing Make.defs vs CMakeLists.txt, seems like CMakeLists.txt are easy to track. I think if you search for second previous "(" symbol after finding the file position (fs_fat32.c) it will find the right symbol. Just an idea!
Lup, I was thinking about these downloading all the time, maybe beside the mirror idea, we could have these files already available in the CI docker image. I don't know whether it is possible or not, but maybe it possible. Although in theory it is good to download the files to confirm the link still available, we know it is a major cause of CI failures and it also impacts in the Runner usage because a test that could be finished early needs to be executed again and again. |
Yes, that would be an important requirement!
My (perhaps naive) idea was to use this tool to either generate the
I figured this would be alright for CI, since Python is used in a few other places for CI already. It also means the script is cross-platform, so no
That works, but then we have to maintain quite a few more labels and |
Yes! This was idea for integrating this tool. The Questions:
I don't think so. The
This might be beyond my understanding of CI, as I had assumed any board can be built on any host (or at least, that's the goal of NuttX)? But maybe we can start with just adopting this on the Linux builds first since that's where most of the work takes place.
Right now it considers this a complex PR, the same as the current CI. So that triggers a full build of all boards.
No kidding! I'm learning that our CI is very complex! And also very important, so it would be bad to break it. |
This is a good idea, and the logical next step for sure. This will be significantly more complex though, as instead of choosing builds based on the changed file's path (as is currently possible for One (very fresh and likely very naive) idea I had was to construct a dependency graph for source files. I.e., each source file in the tree is associated with a list of Of course, this requires that we have the dependency graph in the first place. I don't know how cheaply it can be generated by our build system (i.e., my thought is that CMake will know which build files are getting used for a |
HI @linguini1 Yes, for example, you can choose the toolchain for ARM. This corresponds to nucleo-l152re:nsh
This corresponds to nucleo-f411re:nsh
if we add this to the .dat file
|
|
We should worry about Strange Dependencies like this: Why would a fix for ESP32 RISC-V, affect ESP32 Xtensa? 🤔 |
In this case isn't it a change for both? boards/Kconfig is modified which should cause a full build in our CI I think (?) |
|
Sorry I meant this fix for ESP32 RISC-V: Broke the build for ESP32 Xtensa, as explained here: |
|
it looks like |
Summary
This PR includes
select.py, which is some initial ground-work for more granular selection of builds. It takes in a list of changed files (say, the output ofgit diff --name-only) and outputs a list ofdefconfigfiles to be built to test the change set. The benefit is that it always (1) outputs the minimum number of configurations required to fully test the change set.You can find more information in the docs included with this PR, which show some example invocations.
(1): it should always. There may be edge cases I haven't encountered yet.
Impact
None right now, besides adding a doc file.
This tool is currently not used anywhere in CI, but is a precursor to some modifications I hope to make that reduce the number of builds performed for each PR. I will make those changes once I learn a little bit more about the build selection process and ensure that this solution can be integrated in a way that still allows parallel builds and doesn't require heavy modification of any of the existing tooling.
Testing
Docs:
make autobuildlocally and viewed the render in browser.I have tested this locally using made-up change-sets to see if the output matches the minimum coverage that it should be. In all of my testing so far, the results are looking correct!
See some log examples in the included docs with the sample invocations :)
Relevant example which illustrates benefit
Here is one example using the change-set from #18397. This patch only modifies a single board, by adding new board support to the ST Nucleo-H753ZI. The current CI infrastructure marks the PR as
Arch: Arm, which is currently the largest architecture on NuttX (in terms ofdefconfigfiles to build). The CI for this PR ran for a little over 14h: https://github.com/apache/nuttx/actions/runs/22040681175/usageIf we look at what
select.pyrecommends to build to test this patch, we can see that it's much more granular:The shortest arm build in the original PR ran for 33 minutes (
arm-12). It builds 43 configurations. Ifselect.pywas in charge, 3 configurations would be built. That's roughly 3 minutes of CI usage instead of 14 hours! (if we assume that each build takes approximately the same amount of time).