Supervision strategies and Error Recovery of Graph Actors

There are some external considerations that may lead to a graph actor panicking: persistence DB connectivity problems in particular, which are transient (but we still want to know about them).

The supervisor should be able to schedule actor rematerialisations in the future (perhaps this is better off handled by a helper actor, if a plugin for this isn't already available). It should receive "try waking this up again" messages for graph actors with some jittery exponential backoff (up to a maximum limit). We need some notion of "stability period" after which we consider an actor to have been successfully restarted. Behaviour at the maximum backoff should be to retry indefinitely(? maybe - tbd).

We should expose counters for any actors that we are rematerialising via prometheus - those numbers going up may be a signal that we need some operator oversight.

Finally, there's a gotcha in the current architecture if we do this. Currently a panicked actor will fail any stages that were under execution in fn; we assume we're recovering on system restart. For a transiently panicking graph, there may be an extant executor goroutine out there (with the PID of the panicked actor?) which still holds open a connection to fn. TBD: Should we (a) deal with failing those fired stages differently? (b) arrange it so that an executor goroutine can route a message to the new graph actor?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supervision strategies and Error Recovery of Graph Actors #109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Supervision strategies and Error Recovery of Graph Actors #109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions