Skip to content

add pollHealthChecker interface for optional RPC health checks#83

Open
Krish-vemula wants to merge 5 commits intomainfrom
cre/PLEX-2476
Open

add pollHealthChecker interface for optional RPC health checks#83
Krish-vemula wants to merge 5 commits intomainfrom
cre/PLEX-2476

Conversation

@Krish-vemula
Copy link

@Krish-vemula Krish-vemula commented Feb 17, 2026

Summary

Adds an optional PollHealthCheck method to the RPCClient interface, enabling chain-specific RPC clients to perform additional health checks during node pool polling. Failures from this check count toward the PollFailureThreshold, allowing automatic detection and failover from unhealthy RPC nodes.

Supports: #352

Add optional interface for chain-specific RPC clients to run extra health
checks during alive-loop polling. Failures count toward poll failure threshold.

Enables chain integrations to detect issues like missing historical state.
@Krish-vemula Krish-vemula marked this pull request as ready for review February 18, 2026 23:02
@Krish-vemula Krish-vemula requested a review from a team as a code owner February 18, 2026 23:02
…r finalized state availability with configurable threshold and regex-based error classification.
return c.MultiNode.FinalizedStateCheckEnabled != nil && *c.MultiNode.FinalizedStateCheckEnabled
}

func (c *MultiNodeConfig) FinalizedStateCheckAddress() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If properly implemented, this should never panic because of the nil value. If we see a panic, it's an early signal that config overrides are not working as expected.
I agree that, in general, we should be cautious and check for nils, but in this case we should follow the common config structure to keep things consistent and spot issues early.

lggr.Tracew("Pinging RPC", "nodeState", n.State(), "pollFailures", pollFailures)
pollCtx, cancel := context.WithTimeout(ctx, pollInterval)
version, pingErr := n.RPC().ClientVersion(pollCtx)
if pingErr == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundat with new logic, no?

finalizedStateFailures++
}
lggr.Warnw("Finalized state not available", "err", stateErr, "failures", finalizedStateFailures, "threshold", finalizedStateCheckFailureThreshold)
if finalizedStateCheckFailureThreshold > 0 && finalizedStateFailures >= finalizedStateCheckFailureThreshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO finalizedStateCheckFailureThreshold > 0 is redundant, since we control the healthcheck via finalizedStateCheckEnabled

}
lggr.Warnw("Finalized state not available", "err", stateErr, "failures", finalizedStateFailures, "threshold", finalizedStateCheckFailureThreshold)
if finalizedStateCheckFailureThreshold > 0 && finalizedStateFailures >= finalizedStateCheckFailureThreshold {
lggr.Errorw("RPC node cannot serve finalized state after consecutive failures", "failures", finalizedStateFailures)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's introduce a metric similar to PollsFailed to have better visibility into the failure rate.

case <-time.After(dialRetryBackoff.Duration()):
lggr.Tracew("Trying to re-dial RPC node", "nodeState", n.getCachedState())

err := n.rpc.Dial(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use createVerifiedConn and wait for at least on sucesfull poll of CheckFinalizedStateAvailability

// isFinalizedStateUnavailableError checks if the error indicates that the RPC cannot serve
// historical state (as opposed to an RPC reachability issue).
// If regexPattern is empty, all errors are treated as state unavailable errors.
func isFinalizedStateUnavailableError(err error, regexPattern string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classification should be done in evm

return nodeStateUnreachable == node.State()
})
})
t.Run("optional poll health check failure counts as poll failure and transitions to unreachable", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test to verfiy that RPC can be marked as nodeStateFinalizedStateNotAvailable and then marked alive again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants