You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -57,22 +57,102 @@ The work chain runs the subprocess.
57
57
Once it has finished, it then inspects the status.
58
58
If the subprocess finished successfully, the work chain returns the results and its job is done.
59
59
If, instead, the subprocess failed, the work chain should inspect the cause of failure, and attempt to fix the problem and restart the subprocess.
60
-
This cycle is repeated until the subprocess finishes successfully.
61
-
Of course this runs the risk of entering into an infinite loop if the work chain never manages to fix the problem, so we want to build in a limit to the maximum number of calculations that can be re-run:
60
+
This cycle is repeated until either the subprocess finishes successfully or a maximum number of iterations is reached.
An improved flow diagram for the base work chain that limits the maximum number of iterations that the work chain can try and get the calculation to finish successfully.
69
-
70
-
Since this is such a common logical flow for a base work chain that is to wrap another :py:class:`~aiida.engine.processes.process.Process` and restart it until it is finished successfully, we have implemented it as an abstract base class in ``aiida-core``.
62
+
Since this is such a common logical flow for a base work chain that wraps another :py:class:`~aiida.engine.processes.process.Process` and restarts it until it is finished successfully, we have implemented it as an abstract base class in ``aiida-core``.
71
63
The :py:class:`~aiida.engine.processes.workchains.restart.BaseRestartWorkChain` implements the logic of the flow diagram shown above.
72
64
Although the ``BaseRestartWorkChain`` is a subclass of :py:class:`~aiida.engine.processes.workchains.workchain.WorkChain` itself, you cannot launch it.
73
65
The reason is that it is completely general and so does not know which :py:class:`~aiida.engine.processes.process.Process` class it should run.
74
66
Instead, to make use of the base restart work chain, you should subclass it for the process class that you want to wrap.
75
67
68
+
Running a ``BaseRestartWorkChain``
69
+
==================================
70
+
71
+
Many plugin packages have already implemented a ``BaseRestartWorkChain`` for the codes they support, such as the ``PwBaseWorkChain`` in ``aiida-quantumespresso`` or the ``Cp2kBaseWorkChain`` in ``aiida-cp2k``.
72
+
The inputs will depend on the calculation and choices of the developer, but there are several default inputs you can configure to control the behavior of any ``BaseRestartWorkChain``.
73
+
74
+
Specifying the maximum number of iterations
75
+
--------------------------------------------
76
+
77
+
To prevent a work chain from entering an infinite loop if it never manages to fix the problem, the ``BaseRestartWorkChain`` limits the maximum number of times the subprocess can be restarted.
78
+
This is controlled by the ``max_iterations`` input, which defaults to ``5``:
79
+
80
+
.. code-block:: python
81
+
82
+
from aiida.orm import Int
83
+
84
+
inputs = {
85
+
'process_inputs': {
86
+
'input_1': value_1,
87
+
'input_2': value_2
88
+
},
89
+
'max_iterations': Int(10)
90
+
}
91
+
submit(SomeBaseWorkChain, **inputs)
92
+
93
+
If the subprocess fails and is restarted repeatedly until ``max_iterations`` is reached without succeeding, the work chain will abort with exit code ``401`` (``ERROR_MAXIMUM_ITERATIONS_EXCEEDED``).
94
+
95
+
96
+
Handler overrides
97
+
-----------------
98
+
99
+
It is possible to change the priority of handlers and enable/disable them without changing the source code of the work chain.
100
+
These properties of the handlers can be controlled through the ``handler_overrides`` input of the work chain.
101
+
This input takes a ``Dict`` node, that has the following form:
102
+
103
+
.. code-block:: python
104
+
105
+
handler_overrides = Dict({
106
+
'handler_negative_sum': { # Insert the name of the process handler here
107
+
'enabled': True,
108
+
'priority': 10000
109
+
}
110
+
})
111
+
112
+
As you can see, the keys are the name of the handler to affect and the value is a dictionary that can take two keys: ``enabled`` and ``priority``.
113
+
To enable or disable a handler, set ``enabled`` to ``True`` or ``False``, respectively.
114
+
The ``priority`` key takes an integer and determines the priority of the handler.
115
+
Note that the values of the ``handler_overrides`` are fully optional and will override the values configured by the process handler decorator in the source code of the work chain.
116
+
The changes also only affect the work chain instance that receives the ``handler_overrides`` input, all other instances of the work chain that will be launched will be unaffected.
117
+
118
+
119
+
Configuring unhandled failure behavior
120
+
--------------------------------------
121
+
122
+
.. versionadded:: 2.8
123
+
124
+
Before v2.8, a ``BaseRestartWorkChain`` would always restart once for an unhandled failure.
125
+
126
+
There may be cases where a process experience a failure that has no corresponding error handler, but you still want to restart the process.
127
+
A typical example here is a node failure, where you simply want to restart the process without any changes to the input.
128
+
By default, a ``BaseRestartWorkChain`` will abort when it encounters a failure it cannot handle, but this behaviour can be changed through the ``on_unhandled_failure`` input.
129
+
The options are:
130
+
131
+
``abort`` (default)
132
+
The work chain immediately aborts with exit code ``402`` (``ERROR_UNHANDLED_FAILURE``).
133
+
This is the most conservative option and prevents wasting computational resources by rerunning calculations that will likely fail again with the same inputs.
134
+
135
+
``pause``
136
+
The work chain pauses for user inspection.
137
+
This allows you to examine the failed subprocess and decide whether to continue or abort.
138
+
When paused, you can:
139
+
140
+
- Use ``verdi process report <PK>`` to inspect the work chain's progress and error messages.
141
+
- Use ``verdi process play <PK>`` to resume the work chain (e.g., if the failure was due to a transient issue like a node failure).
142
+
- Use ``verdi process kill <PK>`` to abort the work chain if the problem cannot be resolved.
143
+
144
+
``restart_once``
145
+
The work chain will automatically restart the subprocess once.
146
+
If the subprocess fails again with another unhandled failure, the work chain aborts with ``ERROR_UNHANDLED_FAILURE``.
147
+
148
+
``restart_and_pause``
149
+
The work chain combines the previous two strategies: it restarts the subprocess once, and if that also results in an unhandled failure, it pauses for user inspection rather than aborting.
150
+
This provides a balance between automatic recovery and user control.
151
+
152
+
.. seealso::
153
+
154
+
You may be wondering if it's also possible to change the inputs and restart after inspecting the failure of a paused ``BaseRestartWorkChain``.
155
+
This is currently not yet possible, but we're exploring this possibility, see `the following blog post <https://aiida.net/news/posts/2025-11-21-human-in-the-loop.html>`_.
76
156
77
157
Writing a base restart work chain
78
158
=================================
@@ -120,7 +200,7 @@ Next, as with all work chains, we should *define* its process specification:
120
200
121
201
The inputs and output that we define are essentially determined by the sub process that the work chain will be running.
122
202
Since the ``ArithmeticAddCalculation`` requires the inputs ``x`` and ``y``, and produces the ``sum`` as output, we `mirror` those in the specification of the work chain, otherwise we wouldn't be able to pass the necessary inputs.
123
-
Finally, we define the logical outline, which if you look closely, resembles the logical flow chart presented in :numref:`workflow-error-handling-flow-loop` a lot.
203
+
Finally, we define the logical outline.
124
204
We start by *setting up* the work chain and then enter a loop: *while* the subprocess has not yet finished successfully *and* we haven't exceeded the maximum number of iterations, we *run* another instance of the process and then *inspect* the results.
125
205
The while conditions are implemented in the ``should_run_process`` outline step.
126
206
When the process finishes successfully or we have to abandon, we report the *results*.
@@ -190,7 +270,7 @@ As you can see the work chain launched a single instance of the ``ArithmeticAddC
190
270
Indeed, when updating an existing work chain file or adding a new one, it is **necessary** to restart the daemon **every time** after all changes have taken place.
191
271
192
272
Exposing inputs and outputs
193
-
===========================
273
+
---------------------------
194
274
195
275
Any base restart work chain *needs* to *expose* the inputs of the subprocess it wraps, and most likely *wants* to do the same for the outputs it produces, although the latter is not necessary.
196
276
For the simple example presented in the previous section, simply copy-pasting the input and output port definitions of the subprocess ``ArithmeticAddCalculation`` was not too troublesome.
@@ -253,7 +333,7 @@ When submitting or running the work chain using namespaced inputs (``add`` in th
253
333
254
334
255
335
Customizing outputs
256
-
===================
336
+
-------------------
257
337
258
338
By default, the ``BaseRestartWorkChain`` will attach the exposed outputs of the last completed calculation job.
259
339
In most cases this is the correct behavior, but there might be use-cases where one wants to modify exactly what outputs are attached to the work chain.
@@ -273,7 +353,7 @@ In this case, it is important to go through a ``calcfunction``, as always, as to
273
353
274
354
275
355
Attaching outputs
276
-
=================
356
+
-----------------
277
357
278
358
In a normal run, the ``results`` method is the last step in the outline of the ``BaseRestartWorkChain``.
279
359
In this step, the outputs of the last completed calculation job are "attached" to the work chain itself.
@@ -284,7 +364,7 @@ In this case the work chain will be stopped immediately and the ``results`` step
284
364
285
365
286
366
Error handling
287
-
==============
367
+
--------------
288
368
289
369
So far you have seen how easy it is to get a work chain up and running that will run a subprocess using the ``BaseRestartWorkChain``.
290
370
However, the whole point of this exercise, as described in the introduction, was for the work chain to be able to deal with *failing* processes, yet in the previous example it finished without any problems.
@@ -309,14 +389,11 @@ This time we will see that the work chain takes quite a different path:
As expected, the ``ArithmeticAddCalculation`` failed this time with a ``410``.
316
-
The work chain noticed the failure when inspecting the result of the subprocess in ``inspect_process``, and in keeping with its name and design, restarted the calculation.
317
-
However, since the inputs were not changed, the calculation inevitably and wholly expectedly failed once more with the exact same error code.
318
-
Unlike after the first iteration, however, the work chain did not restart again, but gave up and returned the exit code ``402`` itself, which stands for ``ERROR_SECOND_CONSECUTIVE_UNHANDLED_FAILURE``.
319
-
As the name suggests, the work chain tried to run the subprocess but it failed twice in a row without the problem being *handled*.
As expected, the ``ArithmeticAddCalculation`` failed with a ``410``.
395
+
The work chain noticed the failure when inspecting the result of the subprocess in ``inspect_process``, but since no process handler dealt with this error (we haven't written any yet), it is considered an *unhandled failure*.
396
+
By default, when encountering an unhandled failure, the work chain will abort immediately and return the exit code ``402`` (``ERROR_UNHANDLED_FAILURE``).
320
397
The obvious question now of course is: "How exactly can we instruct the base work chain to handle certain problems?"
321
398
322
399
Since the problems are necessarily dependent on the subprocess that the work chain will run, it cannot be implemented by the ``BaseRestartWorkChain`` class itself, but rather will have to be implemented by the subclass.
@@ -384,7 +461,7 @@ Instead of having a conditional at the start of each handler to compare the exit
384
461
If the ``exit_codes`` keyword is defined, which can be either a single instance of :class:`~aiida.engine.processes.exit_code.ExitCode` or a list thereof, the process handler will only be called if the exit status of the node corresponds to one of those exit codes, otherwise it will simply be skipped.
385
462
386
463
Multiple process handlers
387
-
=========================
464
+
-------------------------
388
465
389
466
Since typically a base restart work chain implementation will have more than one process handler, one might want to control the order in which they are called.
390
467
This can be done through the ``priority`` keyword:
@@ -433,25 +510,3 @@ The base restart work chain will detect this exit code and abort the work chain,
433
510
434
511
With these basic tools, a broad range of use-cases can be addressed while preventing a lot of boilerplate code.
435
512
436
-
437
-
Handler overrides
438
-
=================
439
-
440
-
It is possible to change the priority of handlers and enable/disable them without changing the source code of the work chain.
441
-
These properties of the handlers can be controlled through the ``handler_overrides`` input of the work chain.
442
-
This input takes a ``Dict`` node, that has the following form:
443
-
444
-
.. code-block:: python
445
-
446
-
handler_overrides = Dict({
447
-
'handler_negative_sum': {
448
-
'enabled': True,
449
-
'priority': 10000
450
-
}
451
-
})
452
-
453
-
As you can see, the keys are the name of the handler to affect and the value is a dictionary that can take two keys: ``enabled`` and ``priority``.
454
-
To enable or disable a handler, set ``enabled`` to ``True`` or ``False``, respectively.
455
-
The ``priority`` key takes an integer and determines the priority of the handler.
456
-
Note that the values of the ``handler_overrides`` are fully optional and will override the values configured by the process handler decorator in the source code of the work chain.
457
-
The changes also only affect the work chain instance that receives the ``handler_overrides`` input, all other instances of the work chain that will be launched will be unaffected.
0 commit comments