-
Notifications
You must be signed in to change notification settings - Fork 86
Fix Cloud Run deployments after Dockerfile relocation #2558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Dockerfiles were moved to docker/ directory in commit fec63ec as part of the mise infrastructure refactoring, but deployment workflows were never updated. The new Dockerfiles expect binaries in architecture-specific subdirectories (${TARGETARCH}/), but workflows were copying binaries directly to crate directories. Fixed workflows: - deploy-agent-api.yaml: Create amd64/ subdir, copy agent and sops - deploy-data-plane-controller.yaml: Create amd64/ subdir, copy binary, entrypoint, and sops - deploy-oidc-discovery-server.yaml: Create amd64/ subdir, copy binary and entrypoint Each workflow now: 1. Creates amd64/ subdirectory in the source location 2. Copies the built binary into amd64/ 3. Copies required scripts (entrypoint, sops) into amd64/ 4. Copies the Dockerfile from docker/ to the source directory 5. Cloud Run deployment proceeds with correct directory structure Verified locally that Docker builds succeed with this structure. This fixes Cloud Run deployments that have been broken since December 1st. Discovered when #2549 triggered the data-plane-controller deployment.
2d1cef0 to
12d05c2
Compare
jgraettinger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though this ought to be cleaned up further.
A better structure would be to let the Platform Build CI task complete and then deploy the pre-built/pushed images directly, rather than compiling and building an image from scratch.
baa697e to
12d05c2
Compare
|
…scratch
Refactors deployment workflows to use pre-built Docker images from Platform Build CI
rather than rebuilding from source on every deployment, as suggested by Johnny.
Benefits:
- Faster deployments (no build step)
- Deploy exactly what was tested in CI
- More efficient resource usage
- Simpler workflows (39 net lines removed)
Changes:
- data-plane-controller: Triggers on Platform Build completion, deploys pre-built image
- oidc-discovery-server: Triggers on Platform Build completion, deploys pre-built image
- agent-api: Manual trigger only, deploys pre-built dev-next image
All workflows now:
1. Determine the image tag (from Platform Build or dev-next for manual triggers)
2. Deploy the pre-built image from ghcr.io/estuary/{service}:{tag}
3. Skip all build/package/Docker build steps
Images are built by Platform Build workflow via mise/tasks/ci/docker-images which:
- Builds all docker/*.Dockerfile images as multi-arch (amd64/arm64)
- Pushes to ghcr.io/estuary/ with git describe tags
- Runs automatically on every master branch push
Cloud Run cannot pull directly from external registries like ghcr.io.
Instead, it requires using Artifact Registry remote repositories which
act as a proxy to the external registry.
Updated all deployment workflows to use the Artifact Registry proxy path:
us-central1-docker.pkg.dev/estuary-control/ghcr/estuary/{service}:{tag}
The remote repository was created with:
gcloud artifacts repositories create ghcr \
--project=estuary-control \
--repository-format=docker \
--location=us-central1 \
--mode=remote-repository \
--remote-docker-repo=https://ghcr.io
Platform Build continues pushing to ghcr.io, and Artifact Registry
automatically proxies pulls from Cloud Run to the upstream registry.
|
Tagging @jgraettinger for re-review. I went back and forth a few times on this because I forgot and then remembered why these are completely built in Cloud Build. The original deployment workflows built from source using Cloud Build, which automatically pushed images to Artifact Registry ( From Google Cloud docs:
We have been building over there to get around this limitation for a while. The Fix I've set up an Artifact Registry remote repository that acts as a proxy to ghcr.io: gcloud artifacts repositories create ghcr \
--project=estuary-control \
--repository-format=docker \
--location=us-central1 \
--mode=remote-repository \
--remote-docker-repo=https://ghcr.ioI've pulled the latest data-plane-controller ( |
jgraettinger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % questions
| project_id: estuary-control | ||
| region: us-central1 | ||
| source: crates/agent/ | ||
| image: ${{ steps.image-tag.outputs.image }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sanity check: does it pin the sha256 of the current dev-next, or will it truly pull the "latest" dev-next every time Cloud Run decides to pull the image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://docs.cloud.google.com/run/docs/deploying#service
Deploying to a service for the first time creates its first revision. Note that revisions are immutable. If you deploy from a container image tag, it will be resolved to a digest and the revision will always serve this particular digest.
Without fetch-depth: 0, actions/checkout does a shallow clone that doesn't include tags. This causes git describe --tags to fall back to just the commit hash instead of the proper version tag. Added to deploy-data-plane-controller and deploy-oidc-discovery-server workflows which use git describe to determine the image tag. The deploy-agent-api workflow doesn't need this since it hardcodes TAG="dev-next".
Summary
Fixes three broken Cloud Run deployment workflows that were never updated after the December 1st Dockerfile relocation (commit fec63ec).
Problem
The mise infrastructure refactoring moved Dockerfiles from crate directories to
docker/:crates/agent/Dockerfile→docker/control-plane-agent.Dockerfilecrates/data-plane-controller/Dockerfile→docker/data-plane-controller.Dockerfilecrates/oidc-discovery-server/Dockerfile→docker/oidc-discovery-server.DockerfileThe new Dockerfiles use
COPY ${TARGETARCH}/binaryto copy binaries from architecture-specific subdirectories, but the workflows were still copying binaries directly to crate directories without creating theamd64/subdirectory structure.This caused all three deployments to fail when triggered.
Solution
Updated all three workflows to:
amd64/subdirectory in the deployment source directoryamd64/amd64/docker/to the source directoryTesting
Locally verified that Docker builds succeed with this directory structure by simulating the workflow steps and running
docker build.Fixed Workflows
deploy-agent-api.yamldeploy-data-plane-controller.yaml(immediately broken by Add geo_region field to AnsibleHost struct #2549)deploy-oidc-discovery-server.yamlRelated