-
Notifications
You must be signed in to change notification settings - Fork 87
Description
Problem
Today, data-plane-controller runs as a cloud run job -- it runs for an hour and then gracefully exits when ready, and a crontab starts the next one. The job is given a good amount of memory, as Pulumi / Ansible can use a bunch, which limits our effective concurrency to 1. This is currently barely enough to keep up with periodic infra refreshes, and releases take a very long time unless we remember to manually spam the "Execute" button -- and even then, that only lasts for an hour. We'd like to have more elastic scaling so that data planes config changes are applied as quickly as possible.
However, there's a related risk: if data-plane config changes go out very quickly, that also means that an error in the configuration has maximum splash damage. We need to minimize the number of manual steps involved in a release and the chance for human error, and have more automation waterfall to catch problems early.
Proposal
We will introduce a "tier" concept (an integer) to data-plane deployment configurations.
- Tiers will range from zero to 100, with 50 being the default if not specified
We will introduce a "tag" concept (a string) to data-plane deployment configurations
- Tags are optional metadata describing the purpose of a deployment and default to empty.
- Tags are folded into the generated deployment hash
- ... Only if the tag is non-empty, to prevent churning deployment hashes
We will introduce a max_tier column to the data_plane_releases table
- A matched table row will apply to a deployment if and only if the row's
max_tieris >= the deploymenttier.
We will refactor data-plane-controller to be not only a DB automation, but also an HTTP Server
- We'll run both a Cloud Run Job (as we do today), and a Cloud Run service.
- The Job will continue to run the
automationscrate executor, but will dispatch an HTTP request to the service rather than doing the work itself.- This makes jobs very lightweight, and we can run a single instance with effectively infinite job concurrency.
- The Cloud Run Service will accept the request and do the actual work.
- It'd be configured with generous RAM, a request concurrency of 1, and an effectively infinite timeout.
- The Job client will manage the actual timeout, and cancel the request on reaching it.
- The Server would need to quickly wind down work, gracefully if possible
- ... though anything better than a hard kill is a strict improvement compared to today
Discussion
This design solves a few problems for us:
-
It makes deployment scaling elastic
- The cloud run service will scale up and down as required to serve HTTP requests initiated by the job.
-
It allows for single-instance canary deployments without manual steps
- We'd continuously operate single-instance deployments in canary'd data-planes with a low tier.
- Canaries would have a
canarytag value, which influences their hash- ... which lets them live alongside the "main" high-tier deployment even if they're on the same version.
-
It lets us perform multi-stage rollouts where we deploy
- First to private test planes
- Ten to production canaries
- Then to an arbitrary-depth waterfall of tiered public and private deployments
- Each stage of the rollout is achieved by ratcheting
max_tierof adata_plane_releasesrow upwards, minimizing chance for human error.