This repository was archived by the owner on May 18, 2021. It is now read-only.

Description
There are some external considerations that may lead to a graph actor panicking: persistence DB connectivity problems in particular, which are transient (but we still want to know about them).
The supervisor should be able to schedule actor rematerialisations in the future (perhaps this is better off handled by a helper actor, if a plugin for this isn't already available). It should receive "try waking this up again" messages for graph actors with some jittery exponential backoff (up to a maximum limit). We need some notion of "stability period" after which we consider an actor to have been successfully restarted. Behaviour at the maximum backoff should be to retry indefinitely(? maybe - tbd).
We should expose counters for any actors that we are rematerialising via prometheus - those numbers going up may be a signal that we need some operator oversight.
Finally, there's a gotcha in the current architecture if we do this. Currently a panicked actor will fail any stages that were under execution in fn; we assume we're recovering on system restart. For a transiently panicking graph, there may be an extant executor goroutine out there (with the PID of the panicked actor?) which still holds open a connection to fn. TBD: Should we (a) deal with failing those fired stages differently? (b) arrange it so that an executor goroutine can route a message to the new graph actor?