Skip to content

Conversation

@skord
Copy link
Member

@skord skord commented Dec 15, 2025

Summary

Fixes three broken Cloud Run deployment workflows that were never updated after the December 1st Dockerfile relocation (commit fec63ec).

Problem

The mise infrastructure refactoring moved Dockerfiles from crate directories to docker/:

  • crates/agent/Dockerfiledocker/control-plane-agent.Dockerfile
  • crates/data-plane-controller/Dockerfiledocker/data-plane-controller.Dockerfile
  • crates/oidc-discovery-server/Dockerfiledocker/oidc-discovery-server.Dockerfile

The new Dockerfiles use COPY ${TARGETARCH}/binary to copy binaries from architecture-specific subdirectories, but the workflows were still copying binaries directly to crate directories without creating the amd64/ subdirectory structure.

This caused all three deployments to fail when triggered.

Solution

Updated all three workflows to:

  1. Create amd64/ subdirectory in the deployment source directory
  2. Copy built binaries into amd64/
  3. Copy required files (entrypoint scripts, sops) into amd64/
  4. Copy the Dockerfile from docker/ to the source directory
  5. Proceed with Cloud Run deployment

Testing

Locally verified that Docker builds succeed with this directory structure by simulating the workflow steps and running docker build.

Fixed Workflows

Related

@skord skord requested a review from a team December 15, 2025 15:17
@skord skord self-assigned this Dec 15, 2025
Dockerfiles were moved to docker/ directory in commit fec63ec as part
of the mise infrastructure refactoring, but deployment workflows were
never updated.

The new Dockerfiles expect binaries in architecture-specific subdirectories
(${TARGETARCH}/), but workflows were copying binaries directly to crate
directories.

Fixed workflows:
- deploy-agent-api.yaml: Create amd64/ subdir, copy agent and sops
- deploy-data-plane-controller.yaml: Create amd64/ subdir, copy binary, entrypoint, and sops
- deploy-oidc-discovery-server.yaml: Create amd64/ subdir, copy binary and entrypoint

Each workflow now:
1. Creates amd64/ subdirectory in the source location
2. Copies the built binary into amd64/
3. Copies required scripts (entrypoint, sops) into amd64/
4. Copies the Dockerfile from docker/ to the source directory
5. Cloud Run deployment proceeds with correct directory structure

Verified locally that Docker builds succeed with this structure.

This fixes Cloud Run deployments that have been broken since December 1st.
Discovered when #2549 triggered the data-plane-controller deployment.
@skord skord force-pushed the fix-data-plane-controller-deployment branch from 2d1cef0 to 12d05c2 Compare December 15, 2025 16:22
Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though this ought to be cleaned up further.

A better structure would be to let the Platform Build CI task complete and then deploy the pre-built/pushed images directly, rather than compiling and building an image from scratch.

@skord skord force-pushed the fix-data-plane-controller-deployment branch from baa697e to 12d05c2 Compare December 16, 2025 16:38
@github-actions
Copy link

PR Preview Action v1.6.3

🚀 View preview at
https://estuary.github.io/flow/pr-preview/pr-2558/

Built to branch gh-pages at 2025-12-16 16:39 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

…scratch

Refactors deployment workflows to use pre-built Docker images from Platform Build CI
rather than rebuilding from source on every deployment, as suggested by Johnny.

Benefits:
- Faster deployments (no build step)
- Deploy exactly what was tested in CI
- More efficient resource usage
- Simpler workflows (39 net lines removed)

Changes:
- data-plane-controller: Triggers on Platform Build completion, deploys pre-built image
- oidc-discovery-server: Triggers on Platform Build completion, deploys pre-built image
- agent-api: Manual trigger only, deploys pre-built dev-next image

All workflows now:
1. Determine the image tag (from Platform Build or dev-next for manual triggers)
2. Deploy the pre-built image from ghcr.io/estuary/{service}:{tag}
3. Skip all build/package/Docker build steps

Images are built by Platform Build workflow via mise/tasks/ci/docker-images which:
- Builds all docker/*.Dockerfile images as multi-arch (amd64/arm64)
- Pushes to ghcr.io/estuary/ with git describe tags
- Runs automatically on every master branch push
Cloud Run cannot pull directly from external registries like ghcr.io.
Instead, it requires using Artifact Registry remote repositories which
act as a proxy to the external registry.

Updated all deployment workflows to use the Artifact Registry proxy path:
  us-central1-docker.pkg.dev/estuary-control/ghcr/estuary/{service}:{tag}

The remote repository was created with:
  gcloud artifacts repositories create ghcr \
    --project=estuary-control \
    --repository-format=docker \
    --location=us-central1 \
    --mode=remote-repository \
    --remote-docker-repo=https://ghcr.io

Platform Build continues pushing to ghcr.io, and Artifact Registry
automatically proxies pulls from Cloud Run to the upstream registry.
@skord
Copy link
Member Author

skord commented Dec 16, 2025

Tagging @jgraettinger for re-review. I went back and forth a few times on this because I forgot and then remembered why these are completely built in Cloud Build.

The original deployment workflows built from source using Cloud Build, which automatically pushed images to Artifact Registry (us-central1-docker.pkg.dev/estuary-control/cloud-run-source-deploy/...). When I refactored to use pre-built images from Platform Build, I tried pointing directly to ghcr.io/estuary/* - but that doesn't work.

From Google Cloud docs:

"To deploy public or private container images that are not stored in Artifact Registry or Docker Hub, set up an Artifact Registry remote repository."

We have been building over there to get around this limitation for a while.

The Fix

I've set up an Artifact Registry remote repository that acts as a proxy to ghcr.io:

gcloud artifacts repositories create ghcr \
  --project=estuary-control \
  --repository-format=docker \
  --location=us-central1 \
  --mode=remote-repository \
  --remote-docker-repo=https://ghcr.io

I've pulled the latest data-plane-controller (docker pull us-central1-docker.pkg.dev/estuary-control/ghcr/estuary/data-plane-controller:v0.6.0-83-g5d38214e50) on my laptop to ensure that this works.

Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % questions

project_id: estuary-control
region: us-central1
source: crates/agent/
image: ${{ steps.image-tag.outputs.image }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity check: does it pin the sha256 of the current dev-next, or will it truly pull the "latest" dev-next every time Cloud Run decides to pull the image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://docs.cloud.google.com/run/docs/deploying#service

Deploying to a service for the first time creates its first revision. Note that revisions are immutable. If you deploy from a container image tag, it will be resolved to a digest and the revision will always serve this particular digest.

Without fetch-depth: 0, actions/checkout does a shallow clone that
doesn't include tags. This causes git describe --tags to fall back
to just the commit hash instead of the proper version tag.

Added to deploy-data-plane-controller and deploy-oidc-discovery-server
workflows which use git describe to determine the image tag.

The deploy-agent-api workflow doesn't need this since it hardcodes
TAG="dev-next".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants