Skip to content

faster tiered data-plane-controller deploys #2535

@jgraettinger

Description

@jgraettinger

Problem

Today, data-plane-controller runs as a cloud run job -- it runs for an hour and then gracefully exits when ready, and a crontab starts the next one. The job is given a good amount of memory, as Pulumi / Ansible can use a bunch, which limits our effective concurrency to 1. This is currently barely enough to keep up with periodic infra refreshes, and releases take a very long time unless we remember to manually spam the "Execute" button -- and even then, that only lasts for an hour. We'd like to have more elastic scaling so that data planes config changes are applied as quickly as possible.

However, there's a related risk: if data-plane config changes go out very quickly, that also means that an error in the configuration has maximum splash damage. We need to minimize the number of manual steps involved in a release and the chance for human error, and have more automation waterfall to catch problems early.

Proposal

We will introduce a "tier" concept (an integer) to data-plane deployment configurations.

  • Tiers will range from zero to 100, with 50 being the default if not specified

We will introduce a "tag" concept (a string) to data-plane deployment configurations

  • Tags are optional metadata describing the purpose of a deployment and default to empty.
  • Tags are folded into the generated deployment hash
    • ... Only if the tag is non-empty, to prevent churning deployment hashes

We will introduce a max_tier column to the data_plane_releases table

  • A matched table row will apply to a deployment if and only if the row's max_tier is >= the deployment tier.

We will refactor data-plane-controller to be not only a DB automation, but also an HTTP Server

  • We'll run both a Cloud Run Job (as we do today), and a Cloud Run service.
  • The Job will continue to run the automations crate executor, but will dispatch an HTTP request to the service rather than doing the work itself.
    • This makes jobs very lightweight, and we can run a single instance with effectively infinite job concurrency.
  • The Cloud Run Service will accept the request and do the actual work.
    • It'd be configured with generous RAM, a request concurrency of 1, and an effectively infinite timeout.
    • The Job client will manage the actual timeout, and cancel the request on reaching it.
    • The Server would need to quickly wind down work, gracefully if possible
      • ... though anything better than a hard kill is a strict improvement compared to today

Discussion

This design solves a few problems for us:

  • It makes deployment scaling elastic

    • The cloud run service will scale up and down as required to serve HTTP requests initiated by the job.
  • It allows for single-instance canary deployments without manual steps

    • We'd continuously operate single-instance deployments in canary'd data-planes with a low tier.
    • Canaries would have a canary tag value, which influences their hash
      • ... which lets them live alongside the "main" high-tier deployment even if they're on the same version.
  • It lets us perform multi-stage rollouts where we deploy

    • First to private test planes
    • Ten to production canaries
    • Then to an arbitrary-depth waterfall of tiered public and private deployments
    • Each stage of the rollout is achieved by ratcheting max_tier of a data_plane_releases row upwards, minimizing chance for human error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions