Skip to content

A proposal to make bbstorage highly available #230

@lijunsong

Description

@lijunsong

I am proposing a way to make sharded storage highly available. This proposal is deployed in an environment that has tens of shards. The availability of bbstorage increases as expected.

Context

We've seen two types of production reliability issues in buildbarn design:

  1. buildbarn's storage is the single point of failure even if it's sharded; all storage shards must be online to serve the traffic.
  2. storage's sharding and replicating strategy are client-side routing, but grpc config doesn't support LB policy change. For simple deployments, it is not an issue for not having LB policy, but for a more customized architecture, frontends/workers' grpc throughput to storage are usually throttled, causing slow download.

Note: the issue 2 isn't directly related to HA, but as we're trying different architecture, the lack of LB policy makes the architecture I am proposing here less useful.

HA sharded storage

Buildbarn currently doesn't have a way to handle backend failures -- when one backend fails, the action fails, which triggers bazel retry. The retry makes a build look like "hanging" until the backend connects to the frontend again. In another word, the availability of buildbarn depends on every single member in the shards.

Here, I am proposing a way to make sharded storage highly available: introducing a health-checked sharding strategy, where a shard is automatically disabled (or simply use the word in blobstore: "drained") when it's considered "unavailable." The shard selection algorithm will automatically, stably pick the next available shard.

(Note that the goal here is not 0 down time, but to let the system recover automatically from a very short amount of downtime.)

For those who need a bit more context on the current sharding mechanism: the key to understand how sharding is done is https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/sharding_blob_access.go#L35-L56, where:

  1. a blob digest is used to generate a sequence of random integers (the integer is random but the sequence of the randomness is stable)
  2. the integer is used to index into a list of cumulative shard weights (https://github.com/buildbarn/bb-storage/blob/master/pkg/blobstore/sharding/weighted_shard_permuter.go).
  3. select a shard by testing whether the shard of the index is nil. If the shard is nil, it means the shard is marked drained, so continue with the next random integer.
    Because the random integer sequence is stable, when some shards are "drained", the next available shard will always be picked up stably.

The new sharding mechanism

The new mechanism changes the last step where it performs periodic health check (single-digit seconds) and marks unavailable shards "drained", effectively making them nil in the shard list. So the shard selection would pick the next available shards (temporary shards).

When unavailable shards come back, the shard selection algorithm will find out the origin shard number and stop sending traffic to the temporary ones.

The new mechanism is simple but effective. But it's not without its drawbacks.

Issue: Make consensus in routing clients

The sharding algorithm is done in the client (frontend, workers, etc.). When we have hundreds of such clients, how do we make consensus among these clients about which shards are gone?

It can be done using consensus algorithms with additional complexity, but in the simplest form, we don't try to make the consensus in our production setup. The goal we want to achieve is not 0 downtime, but automatically recover from a very short amount of downtime. A few builds will fail, but newly started builds will always succeed (with some cache miss).
In production, what we observe is that:

  1. When any storage goes offline, gRPC is notified almost immediately, so effectively within a few seconds, all clients are reading from/writing to the new shard. Cache miss indeed happens in this case.
  2. When storage comes back online, gRPC’s exponential backoff connection retry could lead to an inconsistent view of shards availability among clients. When a precondition fails, sometimes Bazel retries, but sometimes Bazel fails. IMO, the eventual consistent view is still better than the stuck builds we had before.

Architectural Improvement to help reduce inconsistency when storage goes offline

Instead of the traditional deployment where every frontend and worker connects to every shards. The setup below can simplify ops (adding/draining/removing shards):

  1. Put shards behind a group of frontends (internal frontend)
  2. Let the internal frontend do the routing work
  3. Workers and customer-facing frontends connect to the internal frontend.
              ┌─────────────► shard1        
              │                             
      worker  ├──────────────►shard2        
              │                 .           
              └──────────────►shardN        

becomes

                    ┌─────────────► shard1        
                    │                             
  workers──►frontend├──────────────►shard2        
                    │                 .           
                    └──────────────►shardN  

This deployment brings many benefits in ops that I won't go into detail. This architecture allows a small amount of frontends to handle a large amount of traffic via load balancing. The smaller the frontend groups, the quicker the consensus can form.

To use the new architecture, the gRPC connection from worker to frontend must have L7 load balancing (instead of, say, the k8s L4 load balancing).

Upcoming changes to enable HA storage

This HA storage has been running for a while internally and it works for small/medium scale of buildbarn deployment.

If this design is interesting to others, I'd like to send out pull request to upstream these features:

  1. grpc enable round robin policy via configuration
  2. Create a new blob access that checks the availability of each storage shard
  3. Add a field health_check in sharding configuration; when health check is enabled, use the new blob access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions