Skip to content

Conversation

@mbercx
Copy link
Member

@mbercx mbercx commented Dec 2, 2025

Currently, the BaseRestartWorkChain has hardcoded behavior for unhandled failures: it restarts once, then aborts on the second consecutive failure with the ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE exit code. This approach lacks flexibility for different use cases where users might want immediate abort or allow for human evaluation through pausing.

This commit introduces a new optional input on_unhandled_failure that allows users to configure how the work chain handles unhandled failures. The available options are:

  • abort (default): Abort immediately with ERROR_UNHANDLED_FAILURE
  • pause: Pause the work chain for user inspection
  • restart_once: Restart once, then abort if it fails again (similar to old behavior)
  • restart_and_pause: Restart once, then pause if it still fails

BREAKING: The default behavior is set to abort, which is the most conservative option. In many cases this is the desired behavior, since doing a restart without changing the inputs will typically fail again, wasting resources. Users who want the old "restart once" behavior can explicitly set on_unhandled_failure='restart_once'.

BREAKING: The exit code ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE has been renamed to ERROR_UNHANDLED_FAILURE to better reflect the new flexible behavior where failure doesn't necessarily mean "second consecutive" anymore.

Currently, the `BaseRestartWorkChain` has hardcoded behavior for unhandled
failures: it restarts once, then aborts on the second consecutive failure
with the `ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE` exit code. This approach
lacks flexibility for different use cases where users might want immediate
abort or allow for human evaluation through pausing.

This commit introduces a new optional input `on_unhandled_failure` that allows
users to configure how the work chain handles unhandled failures. The available
options are:

- `abort` (default): Abort immediately with ERROR_UNHANDLED_FAILURE
- `pause`: Pause the work chain for user inspection
- `restart_once`: Restart once, then abort if it fails again (similar to old behavior)
- `restart_and_pause`: Restart once, then pause if it still fails

BREAKING: The default behavior is set to `abort`, which is the most conservative option.
In many cases this is the desired behavior, since doing a restart without changing the
inputs will typically fail again, wasting resources. Users who want the old "restart
once" behavior can explicitly set `on_unhandled_failure='restart_once'`.

BREAKING: The exit code `ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE` has been renamed to
`ERROR_UNHANDLED_FAILURE` to better reflect the new flexible behavior where
failure doesn't necessarily mean "second consecutive" anymore.
@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79.60%. Comparing base (cd11f08) to head (1e75d97).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/aiida/engine/processes/workchains/restart.py 97.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7116      +/-   ##
==========================================
+ Coverage   79.58%   79.60%   +0.03%     
==========================================
  Files         566      566              
  Lines       43517    43567      +50     
==========================================
+ Hits        34629    34679      +50     
  Misses       8888     8888              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

In aiidateam#7069, we had logic to adapt the output of `verdi process list`
when the process was paused by the handler (and only in this case,
i.e., not when a user was pausing it).

This, in our (C. Pignedoli's and mine) opinion is very important,
otherwise the user most probably is not aware of why the calculation
was paused (or might not even realize it's paused).

To achieve this, a variable is used to mark if the pause was triggered
by the handlers.
Moreover, `on_paused` is overridden to set the process status accordingly.

Importantly, we also changed the exact logic of `restart_and_pause` to
the one we had implemented in the other PR: namely, after pausing,
the `self.ctx.unhandled_failure` is reset to `False`, so that if
another error occurs after replaying, two attempts are tried before
pausing again.
We want to be as explicit as possible on what the user is supposed to do.
@giovannipizzi
Copy link
Member

Thanks a lot @mbercx! We've tested this with @cpignedoli. We've just taken the liberty to push a couple of commits. Explanation in the commit messages, but mainly:

  • customize the process status string shown in verdi process list, so it is clear for the user that it's paused and they should check the verdi process report
  • adapt the logic: if restart_and_pause, after replaying, it tries again once before pausing
  • Adding a more verbose message in the process report to explain that one should either play or kill, with explicit message on what command(s) to run.

With @cpignedoli we tested all 4 scenarios with the QE app and aiida-qe 4.13, and all worked as expected. So, for us, this is ready to merge!

Let us know if you have final comments (e.g. on the string), otherwise we can merge this, and work on #7095 as a next step :-)

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the most recent commits (that I did :-D with @cpignedoli), this is ready to be merged IMO!

@mbercx
Copy link
Member Author

mbercx commented Dec 4, 2025

Thanks @giovannipizzi and @cpignedoli! I agree that we should update the process status, but just wasn't happy with the override on_paused approach + context variable. I wanted to see if I could find a simpler solution. And indeed, the pause method of the plumpy Process class allows you to pass a message to set as the status:

https://github.com/aiidateam/plumpy/blob/2317b6f2d4aea8c1a998e3b3e4ae86c050cda6d9/src/plumpy/processes.py#L1098-L1130

So I adapted the code in 9a811a7 to just use that instead.

I'm now also looking into adapting the process state to waiting. One hacky approach would be to set the attribute manually, but again: I feel there should be a better solution. Probably the pause method should take care of this in plumpy, but let's see if I can find another clean approach.

Finally, before merging we also still have to update the documentation.

@mbercx
Copy link
Member Author

mbercx commented Dec 4, 2025

Regarding

adapt the logic: if restart_and_pause, after replaying, it tries again once before pausing

I'm fine with this, but am wondering if this is what most users would want. 🤔 Once the process is paused after already failing after a restart, I'm not sure if a user would want another automatic restart before pausing again?

@cpignedoli
Copy link

Tested again the "pause" option with aiida-qe 4.13 works, great work @mbercx
image

@giovannipizzi
Copy link
Member

Once the process is paused after already failing after a restart, I'm not sure if a user would want another automatic restart before pausing again?

I don't know, I guess it's hard to find the perfect solution that makes everybody happy :-) if you prefer the other approach, we would be OK as well

@mbercx
Copy link
Member Author

mbercx commented Dec 4, 2025

I don't know, I guess it's hard to find the perfect solution that makes everybody happy :-) if you prefer the other approach, we would be OK as well

Yeah, I definitely understand your inclination for the "restart -> pause -> restart -> pause" behaviour, so not sure what the "best" solution is. I'll just leave it as is. :)

@mbercx
Copy link
Member Author

mbercx commented Dec 5, 2025

Alright, I still investigated the possibility of updating the process state to "Waiting", but it doesn't seem that simple. I'm leaving my findings below for future reference:

It's possible to set the process state on the node by running:

self.node.set_process_state(ProcessState.WAITING)

Unfortunately, that state does not persist. After running self.pause(), the rest of the inspect_process step is still executed, until return None. I think at this point the state is reverted to "Running". At least I can confirm via

raise ValueError(self.node.base.attributes.get('process_state'))

That right after setting the process state it is indeed waiting, but when running verdi process list it still says "Running".

Fixing this will probably not be trivial, and require changes in plumpy. So let's hold off on that for now.

@mbercx mbercx force-pushed the new/pause-unhandled branch from 9299525 to a78bae0 Compare December 5, 2025 04:29
@mbercx
Copy link
Member Author

mbercx commented Dec 5, 2025

Ok, documentation has been updated! I decided to split up the corresponding how-to into two sections, running and writing, and have moved the "handler overrides" subsection to the first one.

I've also improved the validation error message:

ValueError: invalid attribute value `on_unhandled_failure`: 'pausse'. Must be one of: abort, pause, restart_once, restart_and_pause

and removed EnumData as a valid type. Although I initially thought it would be a good idea, I think providing a string is fine, especially with a good help, documentation and validation. Having to import the Enum seems more inconvenient to me.

@GeigerJ2 since you mentioned you'd like to review, I've requested it. :) But if you're busy, don't worry. @giovannipizzi @cpignedoli you may still want to check the documentation, I have not touched the implementation besides removing the EnumData type in 1e75d97.

@mbercx mbercx marked this pull request as ready for review December 5, 2025 04:39
@mbercx mbercx requested a review from GeigerJ2 December 5, 2025 04:39
@giovannipizzi
Copy link
Member

Thanks a lot Marnik! For me docs are great, so consider my approval still valid!

@giovannipizzi giovannipizzi self-requested a review December 5, 2025 14:35
@cpignedoli
Copy link

Green light from my side!! Thanks a lot @mbercx

@mbercx
Copy link
Member Author

mbercx commented Dec 7, 2025

@GeigerJ2 let me know if you'd still like to have a look at this, else I'll merge tomorrow. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants