A proposal to make bbstorage highly available

I am proposing a way to make sharded storage highly available. This proposal is deployed in an environment that has tens of shards. The availability of bbstorage increases as expected.

# Context

We've seen two types of production reliability issues in buildbarn design:
1. buildbarn's storage is the single point of failure even if it's sharded; all storage shards must be online to serve the traffic.
2. storage's sharding and replicating strategy are client-side routing, but grpc config doesn't support LB policy change. For simple deployments, it is not an issue for not having LB policy, but for a more customized architecture, frontends/workers' grpc throughput to storage are usually throttled, causing slow download.

Note: the issue 2 isn't directly related to HA, but as we're trying different architecture, the lack of LB policy makes the architecture I am proposing here less useful.

# HA sharded storage

Buildbarn currently doesn't have a way to handle backend failures -- when one backend fails, the action fails, which triggers bazel retry. The retry makes a build look like "hanging" until the backend connects to the frontend again. In another word, the availability of buildbarn depends on every single member in the shards.

Here, I am proposing a way to make sharded storage highly available: introducing a health-checked sharding strategy, where a shard is automatically disabled (or simply use the word in blobstore: "drained") when it's considered "unavailable." The shard selection algorithm will automatically, stably pick the next available shard.

(Note that the goal here is not 0 down time, but to let the system recover automatically from a very short amount of downtime.)

For those who need a bit more context on the current sharding mechanism: the key to understand how sharding is done is https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/sharding_blob_access.go#L35-L56, where:
1. a blob digest is used to generate a sequence of random integers (the integer is random but the sequence of the randomness is stable)
2. the integer is used to index into a list of cumulative shard weights (https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/weighted_shard_permuter.go).
6. select a shard by testing whether the shard of the index is nil. If the shard is nil, it means the shard is marked drained, so continue with the next random integer.
Because the random integer sequence is stable, when some shards are "drained", the next available shard will always be picked up stably.

# The new sharding mechanism

The new mechanism changes the last step where it performs periodic health check (single-digit seconds) and marks unavailable shards "drained", effectively making them nil in the shard list. So the shard selection would pick the next available shards (temporary shards).

When unavailable shards come back, the shard selection algorithm will find out the origin shard number and stop sending traffic to the temporary ones.

The new mechanism is simple but effective. But it's not without its drawbacks.

# Issue: Make consensus in routing clients

The sharding algorithm is done in the client (frontend, workers, etc.). When we have hundreds of such clients, how do we make consensus among these clients about which shards are gone?

It can be done using consensus algorithms with additional complexity, but in the simplest form, we don't try to make the consensus in our production setup. The goal we want to achieve is not 0 downtime, but automatically recover from a very short amount of downtime. A few builds will fail, but newly started builds will always succeed (with some cache miss).
In production, what we observe is that:

1. When any storage goes offline, gRPC is notified almost immediately, so effectively within a few seconds, all clients are reading from/writing to the new shard. Cache miss indeed happens in this case.
2. When storage comes back online, gRPC’s exponential backoff connection retry could lead to an inconsistent view of shards availability among clients. When a precondition fails, sometimes Bazel retries, but sometimes Bazel fails. IMO, the eventual consistent view is still better than the stuck builds we had before.

# Architectural Improvement to help reduce inconsistency when storage goes offline

Instead of the traditional deployment where every frontend and worker connects to every shards. The setup below can simplify ops (adding/draining/removing shards):
1. Put shards behind a group of frontends (internal frontend)
2. Let the internal frontend do the routing work
4. Workers and customer-facing frontends connect to the internal frontend.

```
              ┌─────────────► shard1        
              │                             
      worker  ├──────────────►shard2        
              │                 .           
              └──────────────►shardN        
```

becomes

```
                    ┌─────────────► shard1        
                    │                             
  workers──►frontend├──────────────►shard2        
                    │                 .           
                    └──────────────►shardN  
```

This deployment brings many benefits in ops that I won't go into detail. This architecture allows a small amount of frontends to handle a large amount of traffic via load balancing. The smaller the frontend groups, the quicker the consensus can form.

To use the new architecture, the gRPC connection from worker to frontend must have L7 load balancing (instead of, say, the k8s L4 load balancing).

# Upcoming changes to enable HA storage

This HA storage has been running for a while internally and it works for small/medium scale of buildbarn deployment.

If this design is interesting to others, I'd like to send out pull request to upstream these features:

1. grpc enable round robin policy via configuration
2. Create a new blob access that checks the availability of each storage shard
3. Add a field `health_check` in sharding configuration; when health check is enabled, use the new blob access.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A proposal to make bbstorage highly available #230

Context

HA sharded storage

The new sharding mechanism

Issue: Make consensus in routing clients

Architectural Improvement to help reduce inconsistency when storage goes offline

Upcoming changes to enable HA storage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A proposal to make bbstorage highly available #230

Description

Context

HA sharded storage

The new sharding mechanism

Issue: Make consensus in routing clients

Architectural Improvement to help reduce inconsistency when storage goes offline

Upcoming changes to enable HA storage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions