Merge pull request #2433 from redis/DOC-5992-failover-restruct

andy-stark-redis · web-flow · commit 341182db43e1 · 2025-11-21T16:25:36.000Z
DOC-5992 failover restruct
diff --git a/content/develop/clients/failover.md b/content/develop/clients/failover.md
@@ -0,0 +1,103 @@
+---
+categories:
+- docs
+- develop
+- stack
+- oss
+- rs
+- rc
+- oss
+- kubernetes
+- clients
+description: Improve reliability using the failover/failback features of client libraries.
+linkTitle: Failover/failback
+title: Failover and failback
+topics:
+- failover
+- failback
+- resilience
+- health checks
+relatedPages:
+- /develop/clients/jedis/failover
+- /develop/clients/redis-py/failover
+scope: overview
+weight: 50
+---
+
+Some Redis client libraries support
+[failover and failback](https://en.wikipedia.org/wiki/Failover)
+to improve the availability of connections to Redis databases. Use this page
+to get a general overview of the concepts and then see the documentation for
+your client library to learn how to configure it for failover and failback:
+
+- [Jedis]({{< relref "/develop/clients/jedis/failover" >}})
+- [redis-py]({{< relref "/develop/clients/redis-py/failover" >}}) (preview)
+
+## Concepts
+
+You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
+or independent Redis servers that are all suitable to serve your app.
+Typically, you would prefer to use some database endpoints over others for a particular
+instance of your app (perhaps the ones that are closest geographically to the app server
+to reduce network latency). However, if the best endpoint is not available due
+to a failure, it is generally better to switch to another, suboptimal endpoint
+than to let the app fail completely.
+
+*Failover* is the technique of actively checking for connection failures or
+unacceptably slow connections and automatically switching to the best available endpoint 
+when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:
+
+{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
+
+The complementary technique of *failback* then involves periodically checking the health
+of all endpoints that have failed. If any endpoints recover, the failback mechanism
+automatically switches the connection to the one with the highest priority. 
+This could potentially be repeated until the optimal endpoint is available again.
+
+{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
+
+### Detecting connection problems
+
+Redis clients use a [circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern) to detect connection problems.
+
+The circuit breaker is a software component that tracks the sequence of recent
+Redis connection attempts and commands, recording which ones have succeeded and
+which have failed.
+(Note that many command failures are caused by transient errors such as timeouts,
+so before recording a failure, the first response should usually be just to retry
+the command a few times.)
+
+The status of the attempted command calls is kept in a "sliding window", which
+is simply a buffer where the least recent item is dropped as each new
+one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.
+
+{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}
+
+When the number of failures in the window exceeds a configured
+threshold, the circuit breaker declares the server to be unhealthy and triggers
+a failover.
+
+### Selecting a failover target
+
+Since you may have multiple Redis servers available to fail over to, the client
+lets you configure a list of endpoints to try, ordered by priority or
+"weight". When a failover is triggered, the client selects the highest-weighted
+endpoint that is still healthy and uses it for the temporary connection.
+
+### Health checks
+
+Given that the original endpoint had some geographical or other advantage
+over the failover target, you will generally want to fail back to it as soon
+as it recovers. In the meantime, another server might recover that is
+still better than the current failover target, so it might be worth
+failing back to that server even if it is not optimal.
+
+Clients periodically run a "health check" on each server to see if it has recovered.
+The health check can be as simple as sending a Redis
+[`PING`]({{< relref "/commands/ping" >}}) or
+[ECHO]({{< relref "/commands/echo" >}}) command and ensuring that it gives the
+expected response.
+
+You can also configure the client to run health checks on the current target
+server during periods of inactivity, even if no failover has occurred. This can
+help to detect problems even if your app is not actively using the server.
diff --git a/content/develop/clients/jedis/failover.md b/content/develop/clients/jedis/failover.md
@@ -12,85 +12,29 @@ categories:
 description: Improve reliability using the failover/failback features of Jedis.
 linkTitle: Failover/failback
 title: Failover and failback
+topics:
+- failover
+- failback
+- resilience
+- health checks
+- retries
+relatedPages:
+- /develop/clients/failover
+scope: [client-specific, implementation]
 weight: 50
 ---
 
 Jedis supports [failover and failback](https://en.wikipedia.org/wiki/Failover)
 to improve the availability of connections to Redis databases. This page explains
-the concepts and describes how to configure Jedis for failover and failback.
+how to configure Jedis for failover and failback. For an overview of the concepts,
+see the main [Failover/failback]({{< relref "/develop/clients/failover" >}}) page.
 
-## Concepts
-
-You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
-or independent Redis servers that are all suitable to serve your app.
-Typically, you would prefer to use some database endpoints over others for a particular
-instance of your app (perhaps the ones that are closest geographically to the app server
-to reduce network latency). However, if the best endpoint is not available due
-to a failure, it is generally better to switch to another, suboptimal endpoint
-than to let the app fail completely.
-
-*Failover* is the technique of actively checking for connection failures or
-unacceptably slow connections and automatically switching to the best available endpoint 
-when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:
-
-{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
-
-The complementary technique of *failback* then involves periodically checking the health
-of all endpoints that have failed. If any endpoints recover, the failback mechanism
-automatically switches the connection to the one with the highest priority. 
-This could potentially be repeated until the optimal endpoint is available again.
-
-{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
-
-### Detecting connection problems
+## Failover configuration
 
 Jedis uses the [resilience4j](https://resilience4j.readme.io/docs/getting-started)
 library to detect connection problems using a
 [circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).
 
-The circuit breaker is a software component that tracks the sequence of recent
-Redis connection attempts and commands, recording which ones have succeeded and
-which have failed.
-(Note that many command failures are caused by transient errors such as timeouts,
-so before recording a failure, the first response should usually be just to retry
-the command a few times.)
-
-The status of the attempted command calls is kept in a "sliding window", which
-is simply a buffer where the least recent item is dropped as each new
-one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.
-
-{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}
-
-When the number of failures in the window exceeds a configured
-threshold, the circuit breaker declares the server to be unhealthy and triggers
-a failover.
-
-### Selecting a failover target
-
-Since you may have multiple Redis servers available to fail over to, Jedis
-lets you configure a list of endpoints to try, ordered by priority or
-"weight". When a failover is triggered, Jedis selects the highest-weighted
-endpoint that is still healthy and uses it for the temporary connection.
-
-### Health checks
-
-Given that the original endpoint had some geographical or other advantage
-over the failover target, you will generally want to fail back to it as soon
-as it recovers. In the meantime, another server might recover that is
-still better than the current failover target, so it might be worth
-failing back to that server even if it is not optimal.
-
-Jedis periodically runs a "health check" on each server to see if it has recovered.
-The health check can be as simple as
-sending a Redis [`PING`]({{< relref "/commands/ping" >}}) command and ensuring
-that it gives the expected response.
-
-You can also configure Jedis to run health checks on the current target
-server during periods of inactivity, even if no failover has occurred. This can
-help to detect problems even if your app is not actively using the server.
-
-## Failover configuration
-
 The example below shows a simple case with a list of two servers,
 `redis-east` and `redis-west`, where `redis-east` is the preferred
 target. If `redis-east` fails, Jedis should fail over to
@@ -150,7 +94,9 @@ poolConfig.setTestWhileIdle(true);
 poolConfig.setTimeBetweenEvictionRuns(Duration.ofSeconds(1));
 ```
 
-Supply the weighted list of endpoints using the `MultiDbConfig` builder.
+Supply the weighted list of endpoints using the `MultiDbConfig` builder
+(see [Selecting a failover target]({{< relref "/develop/clients/failover#selecting-a-failover-target" >}}) for a full description of how
+the weighted list is used).
 Use the `weight` option to order the endpoints, with the highest
 weight being tried first.
 
@@ -203,7 +149,8 @@ but will also handle the connection management and failover transparently.
 ### Circuit breaker configuration
 
 The `MultiDbConfig.CircuitBreakerConfig` builder lets you pass several options to configure
-the circuit breaker:
+the circuit breaker (see [Detecting connection problems]({{< relref "/develop/clients/failover#detecting-connection-problems" >}}) for more information on how the
+circuit breaker works):
 
 | Builder method | Default value | Description|
 | --- | --- | --- |
@@ -275,9 +222,27 @@ MultiDbClient client = MultiDbClient.builder()
 
 ## Health check configuration
 
-There are several strategies available for health checks that you can configure using the
-`MultiDbConfig` builder. The sections below explain these strategies
-in more detail.
+Each health check consists of one or more separate "probes", each of which is a simple
+test (such as a [`PING`]({{< relref "/commands/ping" >}}) command) to determine if the database is available. The results of the separate probes are combined
+using a configurable policy to determine if the database is healthy.
+
+There are several strategies available for health checks that you can deploy using the
+`MultiDbConfig` builder. Each strategy is a class that implements the `HealthCheckStrategy` 
+interface. Use the constructor of a `HealthCheckStrategy` implementation to pass
+a `HealthCheckStrategy.Config` object to configure the health check behavior.
+The methods of the base `HealthCheckStrategy.Config` builder are shown below.
+Note that some strategies (including your own custom strategies) may use a
+subclass of `HealthCheckStrategy.Config` to provide extra options.
+
+| Builder method | Default value | Description|
+| --- | --- | --- |
+| `interval()` | `1000` | Interval in milliseconds between health checks. |
+| `timeout()` | `1000` | Timeout in milliseconds for health check requests. |
+| `numProbes()` | `3` | Number of probes to perform during each health check. |
+| `delayInBetweenProbes()` | `100` | Delay in milliseconds between probes during a health check. |
+| `policy()` | `ProbingPolicy.BuiltIn.ALL_SUCCESS` | Policy to determine if the database is healthy based on the probe results. The options are `ALL_SUCCESS` (all probes must succeed), `ANY_SUCCESS` (at least one probe must succeed), and `MAJORITY_SUCCESS` (majority of probes must succeed). |
+
+The sections below explain the available strategies in more detail.
 
 ### `PingStrategy` (default)
 
@@ -287,6 +252,23 @@ and checks that it gives the expected response. Any unexpected response
 or exception indicates an unhealthy server. Although `PingStrategy` is
 very simple, it is a good basic approach for most Redis deployments.
 
+Although `PingStrategy` is the default, you can still activate it
+explicitly using the `healthCheckStrategy()` method of the `MultiDbConfig.DatabaseConfig`
+builder. Use this approach if you want to configure the default
+`PingStrategy` with custom options, as shown in the example below.
+
+```java
+MultiDbConfig.DatabaseConfig dbConfig =
+        MultiDbConfig.DatabaseConfig.builder(hostAndPort, clientConfig)
+                .healthCheckStrategy(new PingStrategy(PingStrategy.Config.builder()
+                        .interval(5000) // Check every 5 seconds
+                        .timeout(3000) // 3 second timeout
+                        .numProbes(5) // 5 probes per check
+                        .delayInBetweenProbes(100) // 100ms delay between probes
+                        .build()))
+                .build();
+```
+
 ### `LagAwareStrategy` (preview)
 
 `LagAwareStrategy` (currently in preview) is designed specifically for
@@ -320,13 +302,12 @@ MultiDbConfig.DatabaseConfig dbConfig =
                 .build();
 ```
 
-The `LagAwareStrategy.Config` builder has the following options:
+The `LagAwareStrategy.Config` builder has the following options in
+addition to the standard options provided by `HealthCheckStrategy.Config`:
 
 | Builder method | Default value | Description|
 | --- | --- | --- |
 | `sslOptions()` | `null` | Standard SSL options for connecting to the REST API. |
-| `interval()` | `5000` | Interval in milliseconds between health checks. |
-| `timeout()` | `3000` | Timeout in milliseconds for health check requests. |
 | `extendedCheckEnabled()` | `false` | Enable extended lag checking (this includes lag validation in addition to the standard datapath validation). |
 | `availabilityLagTolerance()` | `100` | Maximum lag tolerance in milliseconds for extended lag checking. |
 
@@ -366,16 +347,34 @@ MultiDbConfig.DatabaseConfig dbConfig = MultiDbConfig.DatabaseConfig.builder(eas
     .build();
 ```
 
-## Manual failback
+## Managing databases at runtime
+
+Although you will typically configure all databases during the initial connection, you can also modify the configuration at runtime. The example below shows how to add and remove database endpoints.
+
+```java
+HostAndPort other = new HostAndPort("redis-south.example.com", 14000);
+
+// Create the database config as you would for the initial connection.
+client.addDatabase(DatabaseConfig.builder(other, config)
+        // ...
+        .weight(0.5f)
+        .build()
+);
+
+// Remove the database from the failover set.
+client.removeDatabase(other);
+```
+
+### Manual failback
 
 By default, the failback mechanism runs health checks on all servers in the
 weighted list and selects the highest-weighted server that is
 healthy. However, you can also use the `setActiveDatabase()` method of
 `MultiDbClient` to select which database to use manually:
 
 ```java
-// The `setActiveDatabase()` method receives the `HostAndPort` of the
-// cluster to switch to.
+// The `setActiveDatabase()` method receives the `Endpoint` (eg,`HostAndPort`)
+// of the cluster to switch to.
 client.setActiveDatabase(west);
 ```
 
diff --git a/content/develop/clients/redis-py/failover.md b/content/develop/clients/redis-py/failover.md