add pollHealthChecker interface for optional RPC health checks#83
add pollHealthChecker interface for optional RPC health checks#83Krish-vemula wants to merge 5 commits intomainfrom
Conversation
Add optional interface for chain-specific RPC clients to run extra health checks during alive-loop polling. Failures count toward poll failure threshold. Enables chain integrations to detect issues like missing historical state.
…r finalized state availability with configurable threshold and regex-based error classification.
multinode/config/config.go
Outdated
| return c.MultiNode.FinalizedStateCheckEnabled != nil && *c.MultiNode.FinalizedStateCheckEnabled | ||
| } | ||
|
|
||
| func (c *MultiNodeConfig) FinalizedStateCheckAddress() string { |
There was a problem hiding this comment.
If properly implemented, this should never panic because of the nil value. If we see a panic, it's an early signal that config overrides are not working as expected.
I agree that, in general, we should be cautious and check for nils, but in this case we should follow the common config structure to keep things consistent and spot issues early.
| lggr.Tracew("Pinging RPC", "nodeState", n.State(), "pollFailures", pollFailures) | ||
| pollCtx, cancel := context.WithTimeout(ctx, pollInterval) | ||
| version, pingErr := n.RPC().ClientVersion(pollCtx) | ||
| if pingErr == nil { |
There was a problem hiding this comment.
This is redundat with new logic, no?
multinode/node_lifecycle.go
Outdated
| finalizedStateFailures++ | ||
| } | ||
| lggr.Warnw("Finalized state not available", "err", stateErr, "failures", finalizedStateFailures, "threshold", finalizedStateCheckFailureThreshold) | ||
| if finalizedStateCheckFailureThreshold > 0 && finalizedStateFailures >= finalizedStateCheckFailureThreshold { |
There was a problem hiding this comment.
IMO finalizedStateCheckFailureThreshold > 0 is redundant, since we control the healthcheck via finalizedStateCheckEnabled
multinode/node_lifecycle.go
Outdated
| } | ||
| lggr.Warnw("Finalized state not available", "err", stateErr, "failures", finalizedStateFailures, "threshold", finalizedStateCheckFailureThreshold) | ||
| if finalizedStateCheckFailureThreshold > 0 && finalizedStateFailures >= finalizedStateCheckFailureThreshold { | ||
| lggr.Errorw("RPC node cannot serve finalized state after consecutive failures", "failures", finalizedStateFailures) |
There was a problem hiding this comment.
Let's introduce a metric similar to PollsFailed to have better visibility into the failure rate.
multinode/node_lifecycle.go
Outdated
| case <-time.After(dialRetryBackoff.Duration()): | ||
| lggr.Tracew("Trying to re-dial RPC node", "nodeState", n.getCachedState()) | ||
|
|
||
| err := n.rpc.Dial(ctx) |
There was a problem hiding this comment.
Use createVerifiedConn and wait for at least on sucesfull poll of CheckFinalizedStateAvailability
multinode/node_lifecycle.go
Outdated
| // isFinalizedStateUnavailableError checks if the error indicates that the RPC cannot serve | ||
| // historical state (as opposed to an RPC reachability issue). | ||
| // If regexPattern is empty, all errors are treated as state unavailable errors. | ||
| func isFinalizedStateUnavailableError(err error, regexPattern string) bool { |
There was a problem hiding this comment.
Classification should be done in evm
| return nodeStateUnreachable == node.State() | ||
| }) | ||
| }) | ||
| t.Run("optional poll health check failure counts as poll failure and transitions to unreachable", func(t *testing.T) { |
There was a problem hiding this comment.
Add a test to verfiy that RPC can be marked as nodeStateFinalizedStateNotAvailable and then marked alive again.
Summary
Adds an optional
PollHealthCheckmethod to theRPCClientinterface, enabling chain-specific RPC clients to perform additional health checks during node pool polling. Failures from this check count toward thePollFailureThreshold, allowing automatic detection and failover from unhealthy RPC nodes.Supports: #352