💥 `BaseRestartWorkChain`: add `on_unhandled_failure` input #7116

mbercx · 2025-12-02T09:08:36Z

Currently, the BaseRestartWorkChain has hardcoded behavior for unhandled failures: it restarts once, then aborts on the second consecutive failure with the ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE exit code. This approach lacks flexibility for different use cases where users might want immediate abort or allow for human evaluation through pausing.

This commit introduces a new optional input on_unhandled_failure that allows users to configure how the work chain handles unhandled failures. The available options are:

abort (default): Abort immediately with ERROR_UNHANDLED_FAILURE
pause: Pause the work chain for user inspection
restart_once: Restart once, then abort if it fails again (similar to old behavior)
restart_and_pause: Restart once, then pause if it still fails

BREAKING: The default behavior is set to abort, which is the most conservative option. In many cases this is the desired behavior, since doing a restart without changing the inputs will typically fail again, wasting resources. Users who want the old "restart once" behavior can explicitly set on_unhandled_failure='restart_once'.

BREAKING: The exit code ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE has been renamed to ERROR_UNHANDLED_FAILURE to better reflect the new flexible behavior where failure doesn't necessarily mean "second consecutive" anymore.

Currently, the `BaseRestartWorkChain` has hardcoded behavior for unhandled failures: it restarts once, then aborts on the second consecutive failure with the `ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE` exit code. This approach lacks flexibility for different use cases where users might want immediate abort or allow for human evaluation through pausing. This commit introduces a new optional input `on_unhandled_failure` that allows users to configure how the work chain handles unhandled failures. The available options are: - `abort` (default): Abort immediately with ERROR_UNHANDLED_FAILURE - `pause`: Pause the work chain for user inspection - `restart_once`: Restart once, then abort if it fails again (similar to old behavior) - `restart_and_pause`: Restart once, then pause if it still fails BREAKING: The default behavior is set to `abort`, which is the most conservative option. In many cases this is the desired behavior, since doing a restart without changing the inputs will typically fail again, wasting resources. Users who want the old "restart once" behavior can explicitly set `on_unhandled_failure='restart_once'`. BREAKING: The exit code `ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE` has been renamed to `ERROR_UNHANDLED_FAILURE` to better reflect the new flexible behavior where failure doesn't necessarily mean "second consecutive" anymore.

codecov · 2025-12-02T09:10:23Z

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79.60%. Comparing base (cd11f08) to head (1e75d97).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/aiida/engine/processes/workchains/restart.py	97.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7116      +/-   ##
==========================================
+ Coverage   79.58%   79.60%   +0.03%     
==========================================
  Files         566      566              
  Lines       43517    43567      +50     
==========================================
+ Hits        34629    34679      +50     
  Misses       8888     8888

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

In aiidateam#7069, we had logic to adapt the output of `verdi process list` when the process was paused by the handler (and only in this case, i.e., not when a user was pausing it). This, in our (C. Pignedoli's and mine) opinion is very important, otherwise the user most probably is not aware of why the calculation was paused (or might not even realize it's paused). To achieve this, a variable is used to mark if the pause was triggered by the handlers. Moreover, `on_paused` is overridden to set the process status accordingly. Importantly, we also changed the exact logic of `restart_and_pause` to the one we had implemented in the other PR: namely, after pausing, the `self.ctx.unhandled_failure` is reset to `False`, so that if another error occurs after replaying, two attempts are tried before pausing again.

We want to be as explicit as possible on what the user is supposed to do.

giovannipizzi · 2025-12-03T17:49:20Z

Thanks a lot @mbercx! We've tested this with @cpignedoli. We've just taken the liberty to push a couple of commits. Explanation in the commit messages, but mainly:

customize the process status string shown in verdi process list, so it is clear for the user that it's paused and they should check the verdi process report
adapt the logic: if restart_and_pause, after replaying, it tries again once before pausing
Adding a more verbose message in the process report to explain that one should either play or kill, with explicit message on what command(s) to run.

With @cpignedoli we tested all 4 scenarios with the QE app and aiida-qe 4.13, and all worked as expected. So, for us, this is ready to merge!

Let us know if you have final comments (e.g. on the string), otherwise we can merge this, and work on #7095 as a next step :-)

giovannipizzi

With the most recent commits (that I did :-D with @cpignedoli), this is ready to be merged IMO!

mbercx · 2025-12-04T08:59:44Z

Thanks @giovannipizzi and @cpignedoli! I agree that we should update the process status, but just wasn't happy with the override on_paused approach + context variable. I wanted to see if I could find a simpler solution. And indeed, the pause method of the plumpy Process class allows you to pass a message to set as the status:

https://github.com/aiidateam/plumpy/blob/2317b6f2d4aea8c1a998e3b3e4ae86c050cda6d9/src/plumpy/processes.py#L1098-L1130

So I adapted the code in 9a811a7 to just use that instead.

I'm now also looking into adapting the process state to waiting. One hacky approach would be to set the attribute manually, but again: I feel there should be a better solution. Probably the pause method should take care of this in plumpy, but let's see if I can find another clean approach.

Finally, before merging we also still have to update the documentation.

mbercx · 2025-12-04T09:15:08Z

Regarding

adapt the logic: if restart_and_pause, after replaying, it tries again once before pausing

I'm fine with this, but am wondering if this is what most users would want. 🤔 Once the process is paused after already failing after a restart, I'm not sure if a user would want another automatic restart before pausing again?

cpignedoli · 2025-12-04T09:20:25Z

Tested again the "pause" option with aiida-qe 4.13 works, great work @mbercx

giovannipizzi · 2025-12-04T14:31:51Z

Once the process is paused after already failing after a restart, I'm not sure if a user would want another automatic restart before pausing again?

I don't know, I guess it's hard to find the perfect solution that makes everybody happy :-) if you prefer the other approach, we would be OK as well

mbercx · 2025-12-04T23:26:29Z

I don't know, I guess it's hard to find the perfect solution that makes everybody happy :-) if you prefer the other approach, we would be OK as well

Yeah, I definitely understand your inclination for the "restart -> pause -> restart -> pause" behaviour, so not sure what the "best" solution is. I'll just leave it as is. :)

mbercx · 2025-12-05T01:47:01Z

Alright, I still investigated the possibility of updating the process state to "Waiting", but it doesn't seem that simple. I'm leaving my findings below for future reference:

It's possible to set the process state on the node by running:

self.node.set_process_state(ProcessState.WAITING)

Unfortunately, that state does not persist. After running self.pause(), the rest of the inspect_process step is still executed, until return None. I think at this point the state is reverted to "Running". At least I can confirm via

raise ValueError(self.node.base.attributes.get('process_state'))

That right after setting the process state it is indeed waiting, but when running verdi process list it still says "Running".

Fixing this will probably not be trivial, and require changes in plumpy. So let's hold off on that for now.

mbercx · 2025-12-05T04:39:02Z

Ok, documentation has been updated! I decided to split up the corresponding how-to into two sections, running and writing, and have moved the "handler overrides" subsection to the first one.

I've also improved the validation error message:

ValueError: invalid attribute value `on_unhandled_failure`: 'pausse'. Must be one of: abort, pause, restart_once, restart_and_pause

and removed EnumData as a valid type. Although I initially thought it would be a good idea, I think providing a string is fine, especially with a good help, documentation and validation. Having to import the Enum seems more inconvenient to me.

@GeigerJ2 since you mentioned you'd like to review, I've requested it. :) But if you're busy, don't worry. @giovannipizzi @cpignedoli you may still want to check the documentation, I have not touched the implementation besides removing the EnumData type in 1e75d97.

giovannipizzi · 2025-12-05T14:35:05Z

Thanks a lot Marnik! For me docs are great, so consider my approval still valid!

cpignedoli · 2025-12-05T14:42:14Z

Green light from my side!! Thanks a lot @mbercx

mbercx · 2025-12-07T23:02:23Z

@GeigerJ2 let me know if you'd still like to have a look at this, else I'll merge tomorrow. :)

giovannipizzi force-pushed the new/pause-unhandled branch from d2e2153 to 6a7c280 Compare December 3, 2025 17:32

Adding more explicit instructions in verdi process report

be2c61c

We want to be as explicit as possible on what the user is supposed to do.

giovannipizzi force-pushed the new/pause-unhandled branch from 6a7c280 to be2c61c Compare December 3, 2025 17:33

giovannipizzi approved these changes Dec 3, 2025

View reviewed changes

GeigerJ2 added this to aiida-core v2.8.0 Dec 4, 2025

Remove ctx variable in favor of passing status to self.pause

9a811a7

cpignedoli mentioned this pull request Dec 4, 2025

Feature: Pause BaseRestartWorkChain on unhandled errors, rather than failing #7069

Closed

📚 Update documentation

a78bae0

mbercx force-pushed the new/pause-unhandled branch from 9299525 to a78bae0 Compare December 5, 2025 04:29

mbercx added 2 commits December 5, 2025 14:34

👌 Improve validation error message

53de5ef

❌ Remove EnumData as a valid type

1e75d97

mbercx marked this pull request as ready for review December 5, 2025 04:39

mbercx requested a review from GeigerJ2 December 5, 2025 04:39

giovannipizzi self-requested a review December 5, 2025 14:35

giovannipizzi approved these changes Dec 5, 2025

View reviewed changes

mbercx mentioned this pull request Dec 8, 2025

✨ BaseRestartWorkChain: add max iterations per handler and pause #7139

Draft

💥 BaseRestartWorkChain: add on_unhandled_failure input #7116

Are you sure you want to change the base?

💥 BaseRestartWorkChain: add on_unhandled_failure input #7116

Uh oh!

Conversation

mbercx commented Dec 2, 2025

Uh oh!

codecov bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

giovannipizzi commented Dec 3, 2025

Uh oh!

giovannipizzi left a comment

Choose a reason for hiding this comment

Uh oh!

mbercx commented Dec 4, 2025

Uh oh!

mbercx commented Dec 4, 2025

Uh oh!

cpignedoli commented Dec 4, 2025

Uh oh!

giovannipizzi commented Dec 4, 2025

Uh oh!

mbercx commented Dec 4, 2025

Uh oh!

mbercx commented Dec 5, 2025

Uh oh!

mbercx commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giovannipizzi commented Dec 5, 2025

Uh oh!

cpignedoli commented Dec 5, 2025

Uh oh!

mbercx commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

💥 `BaseRestartWorkChain`: add `on_unhandled_failure` input #7116

💥 `BaseRestartWorkChain`: add `on_unhandled_failure` input #7116

codecov bot commented Dec 2, 2025 •

edited

Loading

mbercx commented Dec 5, 2025 •

edited

Loading