Skip to content
This repository was archived by the owner on May 18, 2021. It is now read-only.
This repository was archived by the owner on May 18, 2021. It is now read-only.

Supervision strategies and Error Recovery of Graph Actors #109

@jan-g

Description

@jan-g

There are some external considerations that may lead to a graph actor panicking: persistence DB connectivity problems in particular, which are transient (but we still want to know about them).

The supervisor should be able to schedule actor rematerialisations in the future (perhaps this is better off handled by a helper actor, if a plugin for this isn't already available). It should receive "try waking this up again" messages for graph actors with some jittery exponential backoff (up to a maximum limit). We need some notion of "stability period" after which we consider an actor to have been successfully restarted. Behaviour at the maximum backoff should be to retry indefinitely(? maybe - tbd).

We should expose counters for any actors that we are rematerialising via prometheus - those numbers going up may be a signal that we need some operator oversight.

Finally, there's a gotcha in the current architecture if we do this. Currently a panicked actor will fail any stages that were under execution in fn; we assume we're recovering on system restart. For a transiently panicking graph, there may be an extant executor goroutine out there (with the PID of the panicked actor?) which still holds open a connection to fn. TBD: Should we (a) deal with failing those fired stages differently? (b) arrange it so that an executor goroutine can route a message to the new graph actor?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions