faster tiered data-plane-controller deploys

### Problem

Today, data-plane-controller runs as a cloud run job -- it runs for an hour and then gracefully exits when ready, and a crontab starts the next one. The job is given a good amount of memory, as Pulumi / Ansible can use a bunch, which limits our effective concurrency to 1. This is currently barely enough to keep up with periodic infra refreshes, and releases take a very long time unless we remember to manually spam the "Execute" button -- and even then, that only lasts for an hour. We'd like to have more elastic scaling so that data planes config changes are applied as quickly as possible.

However, there's a related risk: if data-plane config changes go out very quickly, that also means that an error in the configuration has maximum splash damage. We need to minimize the number of manual steps involved in a release and the chance for human error, and have more automation waterfall to catch problems early.

### Proposal

We will introduce a "tier" concept (an integer) to data-plane deployment configurations.
  * Tiers will range from zero to 100, with 50 being the default if not specified

We will introduce a "tag" concept (a string) to data-plane deployment configurations
  * Tags are optional metadata describing the purpose of a deployment and default to empty.
  * Tags are folded into the generated deployment hash
     * ... _Only_ if the tag is non-empty, to prevent churning deployment hashes

We will introduce a `max_tier` column to the `data_plane_releases` table
  * A matched table row will apply to a deployment _if and only if_ the row's `max_tier` is >= the deployment `tier`.

We will refactor `data-plane-controller` to be not only a DB automation, but _also_ an HTTP Server
  * We'll run both a Cloud Run Job (as we do today), and a Cloud Run service.
  * The Job will continue to run the `automations` crate executor, but will dispatch an HTTP request to the service rather than doing the work itself.
    * This makes jobs very lightweight, and we can run a single instance with effectively infinite job concurrency.
  * The Cloud Run Service will accept the request and do the actual work.
    * It'd be configured with generous RAM, a request concurrency of 1, and an effectively infinite timeout.
    * The Job client will manage the **actual** timeout, and cancel the request on reaching it.
    * The Server would need to quickly wind down work, gracefully if possible
        * ... though anything better than a hard kill is a strict improvement compared to today
 
## Discussion

This design solves a few problems for us:

* It makes deployment scaling elastic
   * The cloud run service will scale up and down as required to serve HTTP requests initiated by the job.

* It allows for single-instance canary deployments without manual steps
   * We'd continuously operate single-instance deployments in canary'd data-planes with a low tier.
   * Canaries would have a `canary` tag value, which influences their hash
     * ... which lets them live alongside the "main" high-tier deployment even if they're on the same version.

* It lets us perform multi-stage rollouts where we deploy
   * First to private test planes
   * Ten to production canaries
   * Then to an arbitrary-depth waterfall of tiered public and private deployments
   * Each stage of the rollout is achieved by ratcheting `max_tier` of a `data_plane_releases` row upwards, minimizing chance for human error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

faster tiered data-plane-controller deploys #2535

Problem

Proposal

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

faster tiered data-plane-controller deploys #2535

Description

Problem

Proposal

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions