-
Notifications
You must be signed in to change notification settings - Fork 235
💥 BaseRestartWorkChain: add on_unhandled_failure input
#7116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently, the `BaseRestartWorkChain` has hardcoded behavior for unhandled failures: it restarts once, then aborts on the second consecutive failure with the `ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE` exit code. This approach lacks flexibility for different use cases where users might want immediate abort or allow for human evaluation through pausing. This commit introduces a new optional input `on_unhandled_failure` that allows users to configure how the work chain handles unhandled failures. The available options are: - `abort` (default): Abort immediately with ERROR_UNHANDLED_FAILURE - `pause`: Pause the work chain for user inspection - `restart_once`: Restart once, then abort if it fails again (similar to old behavior) - `restart_and_pause`: Restart once, then pause if it still fails BREAKING: The default behavior is set to `abort`, which is the most conservative option. In many cases this is the desired behavior, since doing a restart without changing the inputs will typically fail again, wasting resources. Users who want the old "restart once" behavior can explicitly set `on_unhandled_failure='restart_once'`. BREAKING: The exit code `ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE` has been renamed to `ERROR_UNHANDLED_FAILURE` to better reflect the new flexible behavior where failure doesn't necessarily mean "second consecutive" anymore.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7116 +/- ##
==========================================
+ Coverage 79.58% 79.60% +0.03%
==========================================
Files 566 566
Lines 43517 43567 +50
==========================================
+ Hits 34629 34679 +50
Misses 8888 8888 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
In aiidateam#7069, we had logic to adapt the output of `verdi process list` when the process was paused by the handler (and only in this case, i.e., not when a user was pausing it). This, in our (C. Pignedoli's and mine) opinion is very important, otherwise the user most probably is not aware of why the calculation was paused (or might not even realize it's paused). To achieve this, a variable is used to mark if the pause was triggered by the handlers. Moreover, `on_paused` is overridden to set the process status accordingly. Importantly, we also changed the exact logic of `restart_and_pause` to the one we had implemented in the other PR: namely, after pausing, the `self.ctx.unhandled_failure` is reset to `False`, so that if another error occurs after replaying, two attempts are tried before pausing again.
d2e2153 to
6a7c280
Compare
We want to be as explicit as possible on what the user is supposed to do.
6a7c280 to
be2c61c
Compare
|
Thanks a lot @mbercx! We've tested this with @cpignedoli. We've just taken the liberty to push a couple of commits. Explanation in the commit messages, but mainly:
With @cpignedoli we tested all 4 scenarios with the QE app and aiida-qe 4.13, and all worked as expected. So, for us, this is ready to merge! Let us know if you have final comments (e.g. on the string), otherwise we can merge this, and work on #7095 as a next step :-) |
giovannipizzi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the most recent commits (that I did :-D with @cpignedoli), this is ready to be merged IMO!
|
Thanks @giovannipizzi and @cpignedoli! I agree that we should update the process status, but just wasn't happy with the override So I adapted the code in 9a811a7 to just use that instead. I'm now also looking into adapting the process state to Finally, before merging we also still have to update the documentation. |
|
Regarding
I'm fine with this, but am wondering if this is what most users would want. 🤔 Once the process is paused after already failing after a restart, I'm not sure if a user would want another automatic restart before pausing again? |
|
Tested again the "pause" option with aiida-qe 4.13 works, great work @mbercx |
I don't know, I guess it's hard to find the perfect solution that makes everybody happy :-) if you prefer the other approach, we would be OK as well |
Yeah, I definitely understand your inclination for the "restart -> pause -> restart -> pause" behaviour, so not sure what the "best" solution is. I'll just leave it as is. :) |
|
Alright, I still investigated the possibility of updating the process state to "Waiting", but it doesn't seem that simple. I'm leaving my findings below for future reference: It's possible to set the process state on the node by running: Unfortunately, that state does not persist. After running That right after setting the process state it is indeed waiting, but when running Fixing this will probably not be trivial, and require changes in |
9299525 to
a78bae0
Compare
|
Ok, documentation has been updated! I decided to split up the corresponding how-to into two sections, running and writing, and have moved the "handler overrides" subsection to the first one. I've also improved the validation error message: and removed @GeigerJ2 since you mentioned you'd like to review, I've requested it. :) But if you're busy, don't worry. @giovannipizzi @cpignedoli you may still want to check the documentation, I have not touched the implementation besides removing the |
|
Thanks a lot Marnik! For me docs are great, so consider my approval still valid! |
|
Green light from my side!! Thanks a lot @mbercx |
|
@GeigerJ2 let me know if you'd still like to have a look at this, else I'll merge tomorrow. :) |

Currently, the
BaseRestartWorkChainhas hardcoded behavior for unhandled failures: it restarts once, then aborts on the second consecutive failure with theERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILUREexit code. This approach lacks flexibility for different use cases where users might want immediate abort or allow for human evaluation through pausing.This commit introduces a new optional input
on_unhandled_failurethat allows users to configure how the work chain handles unhandled failures. The available options are:abort(default): Abort immediately with ERROR_UNHANDLED_FAILUREpause: Pause the work chain for user inspectionrestart_once: Restart once, then abort if it fails again (similar to old behavior)restart_and_pause: Restart once, then pause if it still failsBREAKING: The default behavior is set to
abort, which is the most conservative option. In many cases this is the desired behavior, since doing a restart without changing the inputs will typically fail again, wasting resources. Users who want the old "restart once" behavior can explicitly seton_unhandled_failure='restart_once'.BREAKING: The exit code
ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILUREhas been renamed toERROR_UNHANDLED_FAILUREto better reflect the new flexible behavior where failure doesn't necessarily mean "second consecutive" anymore.