Skip to content

Commit 341182d

Browse files
Merge pull request #2433 from redis/DOC-5992-failover-restruct
DOC-5992 failover restruct
2 parents f5e81c6 + 9d533cc commit 341182d

File tree

3 files changed

+241
-184
lines changed

3 files changed

+241
-184
lines changed
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
---
2+
categories:
3+
- docs
4+
- develop
5+
- stack
6+
- oss
7+
- rs
8+
- rc
9+
- oss
10+
- kubernetes
11+
- clients
12+
description: Improve reliability using the failover/failback features of client libraries.
13+
linkTitle: Failover/failback
14+
title: Failover and failback
15+
topics:
16+
- failover
17+
- failback
18+
- resilience
19+
- health checks
20+
relatedPages:
21+
- /develop/clients/jedis/failover
22+
- /develop/clients/redis-py/failover
23+
scope: overview
24+
weight: 50
25+
---
26+
27+
Some Redis client libraries support
28+
[failover and failback](https://en.wikipedia.org/wiki/Failover)
29+
to improve the availability of connections to Redis databases. Use this page
30+
to get a general overview of the concepts and then see the documentation for
31+
your client library to learn how to configure it for failover and failback:
32+
33+
- [Jedis]({{< relref "/develop/clients/jedis/failover" >}})
34+
- [redis-py]({{< relref "/develop/clients/redis-py/failover" >}}) (preview)
35+
36+
## Concepts
37+
38+
You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
39+
or independent Redis servers that are all suitable to serve your app.
40+
Typically, you would prefer to use some database endpoints over others for a particular
41+
instance of your app (perhaps the ones that are closest geographically to the app server
42+
to reduce network latency). However, if the best endpoint is not available due
43+
to a failure, it is generally better to switch to another, suboptimal endpoint
44+
than to let the app fail completely.
45+
46+
*Failover* is the technique of actively checking for connection failures or
47+
unacceptably slow connections and automatically switching to the best available endpoint
48+
when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:
49+
50+
{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
51+
52+
The complementary technique of *failback* then involves periodically checking the health
53+
of all endpoints that have failed. If any endpoints recover, the failback mechanism
54+
automatically switches the connection to the one with the highest priority.
55+
This could potentially be repeated until the optimal endpoint is available again.
56+
57+
{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
58+
59+
### Detecting connection problems
60+
61+
Redis clients use a [circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern) to detect connection problems.
62+
63+
The circuit breaker is a software component that tracks the sequence of recent
64+
Redis connection attempts and commands, recording which ones have succeeded and
65+
which have failed.
66+
(Note that many command failures are caused by transient errors such as timeouts,
67+
so before recording a failure, the first response should usually be just to retry
68+
the command a few times.)
69+
70+
The status of the attempted command calls is kept in a "sliding window", which
71+
is simply a buffer where the least recent item is dropped as each new
72+
one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.
73+
74+
{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}
75+
76+
When the number of failures in the window exceeds a configured
77+
threshold, the circuit breaker declares the server to be unhealthy and triggers
78+
a failover.
79+
80+
### Selecting a failover target
81+
82+
Since you may have multiple Redis servers available to fail over to, the client
83+
lets you configure a list of endpoints to try, ordered by priority or
84+
"weight". When a failover is triggered, the client selects the highest-weighted
85+
endpoint that is still healthy and uses it for the temporary connection.
86+
87+
### Health checks
88+
89+
Given that the original endpoint had some geographical or other advantage
90+
over the failover target, you will generally want to fail back to it as soon
91+
as it recovers. In the meantime, another server might recover that is
92+
still better than the current failover target, so it might be worth
93+
failing back to that server even if it is not optimal.
94+
95+
Clients periodically run a "health check" on each server to see if it has recovered.
96+
The health check can be as simple as sending a Redis
97+
[`PING`]({{< relref "/commands/ping" >}}) or
98+
[ECHO]({{< relref "/commands/echo" >}}) command and ensuring that it gives the
99+
expected response.
100+
101+
You can also configure the client to run health checks on the current target
102+
server during periods of inactivity, even if no failover has occurred. This can
103+
help to detect problems even if your app is not actively using the server.

content/develop/clients/jedis/failover.md

Lines changed: 78 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -12,85 +12,29 @@ categories:
1212
description: Improve reliability using the failover/failback features of Jedis.
1313
linkTitle: Failover/failback
1414
title: Failover and failback
15+
topics:
16+
- failover
17+
- failback
18+
- resilience
19+
- health checks
20+
- retries
21+
relatedPages:
22+
- /develop/clients/failover
23+
scope: [client-specific, implementation]
1524
weight: 50
1625
---
1726

1827
Jedis supports [failover and failback](https://en.wikipedia.org/wiki/Failover)
1928
to improve the availability of connections to Redis databases. This page explains
20-
the concepts and describes how to configure Jedis for failover and failback.
29+
how to configure Jedis for failover and failback. For an overview of the concepts,
30+
see the main [Failover/failback]({{< relref "/develop/clients/failover" >}}) page.
2131

22-
## Concepts
23-
24-
You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
25-
or independent Redis servers that are all suitable to serve your app.
26-
Typically, you would prefer to use some database endpoints over others for a particular
27-
instance of your app (perhaps the ones that are closest geographically to the app server
28-
to reduce network latency). However, if the best endpoint is not available due
29-
to a failure, it is generally better to switch to another, suboptimal endpoint
30-
than to let the app fail completely.
31-
32-
*Failover* is the technique of actively checking for connection failures or
33-
unacceptably slow connections and automatically switching to the best available endpoint
34-
when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:
35-
36-
{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}
37-
38-
The complementary technique of *failback* then involves periodically checking the health
39-
of all endpoints that have failed. If any endpoints recover, the failback mechanism
40-
automatically switches the connection to the one with the highest priority.
41-
This could potentially be repeated until the optimal endpoint is available again.
42-
43-
{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}
44-
45-
### Detecting connection problems
32+
## Failover configuration
4633

4734
Jedis uses the [resilience4j](https://resilience4j.readme.io/docs/getting-started)
4835
library to detect connection problems using a
4936
[circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).
5037

51-
The circuit breaker is a software component that tracks the sequence of recent
52-
Redis connection attempts and commands, recording which ones have succeeded and
53-
which have failed.
54-
(Note that many command failures are caused by transient errors such as timeouts,
55-
so before recording a failure, the first response should usually be just to retry
56-
the command a few times.)
57-
58-
The status of the attempted command calls is kept in a "sliding window", which
59-
is simply a buffer where the least recent item is dropped as each new
60-
one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.
61-
62-
{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}
63-
64-
When the number of failures in the window exceeds a configured
65-
threshold, the circuit breaker declares the server to be unhealthy and triggers
66-
a failover.
67-
68-
### Selecting a failover target
69-
70-
Since you may have multiple Redis servers available to fail over to, Jedis
71-
lets you configure a list of endpoints to try, ordered by priority or
72-
"weight". When a failover is triggered, Jedis selects the highest-weighted
73-
endpoint that is still healthy and uses it for the temporary connection.
74-
75-
### Health checks
76-
77-
Given that the original endpoint had some geographical or other advantage
78-
over the failover target, you will generally want to fail back to it as soon
79-
as it recovers. In the meantime, another server might recover that is
80-
still better than the current failover target, so it might be worth
81-
failing back to that server even if it is not optimal.
82-
83-
Jedis periodically runs a "health check" on each server to see if it has recovered.
84-
The health check can be as simple as
85-
sending a Redis [`PING`]({{< relref "/commands/ping" >}}) command and ensuring
86-
that it gives the expected response.
87-
88-
You can also configure Jedis to run health checks on the current target
89-
server during periods of inactivity, even if no failover has occurred. This can
90-
help to detect problems even if your app is not actively using the server.
91-
92-
## Failover configuration
93-
9438
The example below shows a simple case with a list of two servers,
9539
`redis-east` and `redis-west`, where `redis-east` is the preferred
9640
target. If `redis-east` fails, Jedis should fail over to
@@ -150,7 +94,9 @@ poolConfig.setTestWhileIdle(true);
15094
poolConfig.setTimeBetweenEvictionRuns(Duration.ofSeconds(1));
15195
```
15296

153-
Supply the weighted list of endpoints using the `MultiDbConfig` builder.
97+
Supply the weighted list of endpoints using the `MultiDbConfig` builder
98+
(see [Selecting a failover target]({{< relref "/develop/clients/failover#selecting-a-failover-target" >}}) for a full description of how
99+
the weighted list is used).
154100
Use the `weight` option to order the endpoints, with the highest
155101
weight being tried first.
156102

@@ -203,7 +149,8 @@ but will also handle the connection management and failover transparently.
203149
### Circuit breaker configuration
204150

205151
The `MultiDbConfig.CircuitBreakerConfig` builder lets you pass several options to configure
206-
the circuit breaker:
152+
the circuit breaker (see [Detecting connection problems]({{< relref "/develop/clients/failover#detecting-connection-problems" >}}) for more information on how the
153+
circuit breaker works):
207154

208155
| Builder method | Default value | Description|
209156
| --- | --- | --- |
@@ -275,9 +222,27 @@ MultiDbClient client = MultiDbClient.builder()
275222

276223
## Health check configuration
277224

278-
There are several strategies available for health checks that you can configure using the
279-
`MultiDbConfig` builder. The sections below explain these strategies
280-
in more detail.
225+
Each health check consists of one or more separate "probes", each of which is a simple
226+
test (such as a [`PING`]({{< relref "/commands/ping" >}}) command) to determine if the database is available. The results of the separate probes are combined
227+
using a configurable policy to determine if the database is healthy.
228+
229+
There are several strategies available for health checks that you can deploy using the
230+
`MultiDbConfig` builder. Each strategy is a class that implements the `HealthCheckStrategy`
231+
interface. Use the constructor of a `HealthCheckStrategy` implementation to pass
232+
a `HealthCheckStrategy.Config` object to configure the health check behavior.
233+
The methods of the base `HealthCheckStrategy.Config` builder are shown below.
234+
Note that some strategies (including your own custom strategies) may use a
235+
subclass of `HealthCheckStrategy.Config` to provide extra options.
236+
237+
| Builder method | Default value | Description|
238+
| --- | --- | --- |
239+
| `interval()` | `1000` | Interval in milliseconds between health checks. |
240+
| `timeout()` | `1000` | Timeout in milliseconds for health check requests. |
241+
| `numProbes()` | `3` | Number of probes to perform during each health check. |
242+
| `delayInBetweenProbes()` | `100` | Delay in milliseconds between probes during a health check. |
243+
| `policy()` | `ProbingPolicy.BuiltIn.ALL_SUCCESS` | Policy to determine if the database is healthy based on the probe results. The options are `ALL_SUCCESS` (all probes must succeed), `ANY_SUCCESS` (at least one probe must succeed), and `MAJORITY_SUCCESS` (majority of probes must succeed). |
244+
245+
The sections below explain the available strategies in more detail.
281246

282247
### `PingStrategy` (default)
283248

@@ -287,6 +252,23 @@ and checks that it gives the expected response. Any unexpected response
287252
or exception indicates an unhealthy server. Although `PingStrategy` is
288253
very simple, it is a good basic approach for most Redis deployments.
289254

255+
Although `PingStrategy` is the default, you can still activate it
256+
explicitly using the `healthCheckStrategy()` method of the `MultiDbConfig.DatabaseConfig`
257+
builder. Use this approach if you want to configure the default
258+
`PingStrategy` with custom options, as shown in the example below.
259+
260+
```java
261+
MultiDbConfig.DatabaseConfig dbConfig =
262+
MultiDbConfig.DatabaseConfig.builder(hostAndPort, clientConfig)
263+
.healthCheckStrategy(new PingStrategy(PingStrategy.Config.builder()
264+
.interval(5000) // Check every 5 seconds
265+
.timeout(3000) // 3 second timeout
266+
.numProbes(5) // 5 probes per check
267+
.delayInBetweenProbes(100) // 100ms delay between probes
268+
.build()))
269+
.build();
270+
```
271+
290272
### `LagAwareStrategy` (preview)
291273

292274
`LagAwareStrategy` (currently in preview) is designed specifically for
@@ -320,13 +302,12 @@ MultiDbConfig.DatabaseConfig dbConfig =
320302
.build();
321303
```
322304

323-
The `LagAwareStrategy.Config` builder has the following options:
305+
The `LagAwareStrategy.Config` builder has the following options in
306+
addition to the standard options provided by `HealthCheckStrategy.Config`:
324307

325308
| Builder method | Default value | Description|
326309
| --- | --- | --- |
327310
| `sslOptions()` | `null` | Standard SSL options for connecting to the REST API. |
328-
| `interval()` | `5000` | Interval in milliseconds between health checks. |
329-
| `timeout()` | `3000` | Timeout in milliseconds for health check requests. |
330311
| `extendedCheckEnabled()` | `false` | Enable extended lag checking (this includes lag validation in addition to the standard datapath validation). |
331312
| `availabilityLagTolerance()` | `100` | Maximum lag tolerance in milliseconds for extended lag checking. |
332313

@@ -366,16 +347,34 @@ MultiDbConfig.DatabaseConfig dbConfig = MultiDbConfig.DatabaseConfig.builder(eas
366347
.build();
367348
```
368349

369-
## Manual failback
350+
## Managing databases at runtime
351+
352+
Although you will typically configure all databases during the initial connection, you can also modify the configuration at runtime. The example below shows how to add and remove database endpoints.
353+
354+
```java
355+
HostAndPort other = new HostAndPort("redis-south.example.com", 14000);
356+
357+
// Create the database config as you would for the initial connection.
358+
client.addDatabase(DatabaseConfig.builder(other, config)
359+
// ...
360+
.weight(0.5f)
361+
.build()
362+
);
363+
364+
// Remove the database from the failover set.
365+
client.removeDatabase(other);
366+
```
367+
368+
### Manual failback
370369

371370
By default, the failback mechanism runs health checks on all servers in the
372371
weighted list and selects the highest-weighted server that is
373372
healthy. However, you can also use the `setActiveDatabase()` method of
374373
`MultiDbClient` to select which database to use manually:
375374

376375
```java
377-
// The `setActiveDatabase()` method receives the `HostAndPort` of the
378-
// cluster to switch to.
376+
// The `setActiveDatabase()` method receives the `Endpoint` (eg,`HostAndPort`)
377+
// of the cluster to switch to.
379378
client.setActiveDatabase(west);
380379
```
381380

0 commit comments

Comments
 (0)