Skip to content

Conversation

@golgeek
Copy link
Contributor

@golgeek golgeek commented Oct 28, 2025

Prior to this patch, roachprod clusters were created from bare Ubuntu images.

This was inadequate for multiple reasons, some of which being:

  • dependency on third-parties (GCS, APT repositories) availability
  • spinning up two clusters at a different moment in time could lead to different resulting systems (package versions, ...) and create reproducibility issues
  • growing number of dependencies installed increases the boot time

To address this, this patch creates a new roachprod bake-images command that relies on Hashicorp Packer to pre-bake ready to use cloud images for AWS and GCP. This creates a system dependency on Packer and requires the machine that runs the command to have Packer installed and to be authenticated on AWS and GCP with authorization to create instances and publish new images. If an image already exist, it won't get built again, making re-running roachprod bake-images safe.

The pre-baking process creates images for amd64, arm64 and fips, and pushes them to the roachprod compatible regions (only for AWS, since images are globally available in GCP). The images are tagged with a hashed checksum of the startup script, which defines their unique version.

At runtime, the providers checksums the startup script to figure out which pre-baked image should be used, and checks for its availability in the cloud provider for that specific region/zone:

  • if the image exists, it is used to create the instance, and only a subset (runtime) of the startup scripts is executed on the instances, decreasing the startup time to a minimum (5s or so for disk setup)
  • if the image does not exists, the system fallbacks to using the base image and the whole startup scripts (pre-baking + runtime) is executed on the instances

This patch also drops the JSON hardcoded AMI IDs (or names in GCP) and introduces auto-discovery of the base image's most recent version based on the image name/family and owner or project ID. This allows us to automatically keep up to date with the latest patch releases, which usually are security updates.

Notes:

  • this patch only contains implementation for AWS and GCP, and Azure and IBM should also be implemented
  • a CI mechanism should be built to automatically build all images when there is a change in the startup scripts (either Github upon merge to master or TeamCity nightly runs)
  • there is currently no built-in way to deprecate/cleanup previous images since they might still be used on older branches; a cleanup routine should be considered if/when the number of images get out of hand

Beyond this first iteration, a concept of "pre-bake only snippets" should come next: snippets that are only executed at pre-baking time and not at runtime even if there is no pre-baked image.
These snippets would contain adhoc roachtest setups (building/pre-installing third party tools like Prometheus/Grafana, Jepsen, Kafka CLI, ...), which would remove the need for these tests to build/install at third party dependencies at runtime if the test is running on an instance supported by a pre-baked image (see #62066 as an example).

Epic: none
Informs: #150144
Release note: None

@blathers-crl
Copy link

blathers-crl bot commented Oct 28, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@golgeek
Copy link
Contributor Author

golgeek commented Oct 28, 2025

It takes 25-30 minutes to build all flavors (amd64, arm64 and fips) for AWS and GCE, but most of this time is spent copying AWS images from the build region to other regions.

The current system ensures that an image is available in each region configured in config.json. Copying only to the regions used in roachtests (us-east2, us-west-2, eu-west-2) would significantly reduce the build time, but would require to run the whole startup script (pre-baking + runtime) for other regions.

@golgeek golgeek force-pushed the ludo/packer branch 3 times, most recently from 7a57c3d to d6edad4 Compare October 29, 2025 15:38
@golgeek golgeek marked this pull request as ready for review October 29, 2025 17:40
@golgeek golgeek requested a review from a team as a code owner October 29, 2025 17:40
@golgeek golgeek requested review from DarrylWong and shailendra-patel and removed request for a team October 29, 2025 17:40
@golgeek golgeek force-pushed the ludo/packer branch 2 times, most recently from 3bd92d9 to a09006f Compare October 30, 2025 19:01
Prior to this patch, roachprod clusters were created from bare Ubuntu
images.

This was inadequate for multiple reasons, some of which being:
- dependency on third-parties (GCS, APT repositories) availability
- spinning up two clusters at a different moment in time could lead to
  different resulting systems (package versions, ...) and create
reproducibility issues
- growing number of dependencies installed increases the boot time

To address this, this patch creates a new roachprod bake-images command
that relies on Hashicorp Packer to pre-bake ready to use cloud images
for AWS and GCP. This creates a system dependency on Packer and requires
the machine that runs the command to have Packer installed and to be
authenticated on AWS and GCP with authorization to create instances and
publish new images. If an image already exist, it won't get built again,
making re-running roachprod bake-images safe.

The pre-baking process creates images for amd64, arm64 and fips, and
pushes them to the roachprod compatible regions (only for AWS, since
images are globally available in GCP). The images are tagged with a
hashed checksum of the startup script, which defines their unique
version.

At runtime, the providers checksums the startup script to figure out
which pre-baked image should be used, and checks for its availability in
the cloud provider for that specific region/zone:
- if the image exists, it is used to create the instance, and only a
  subset (runtime) of the startup scripts is executed on the instances,
  decreasing the startup time to a minimum (5s or so for disk setup)
- if the image does not exists, the system fallbacks to using the base
  image and the whole startup scripts (pre-baking + runtime) is executed
  on the instances

This patch also drops the JSON hardcoded AMI IDs (or names in GCP) and
introduces auto-discovery of the base image's most recent version based
on the image name/family and owner or project ID. This allows us to
automatically keep up to date with the latest patch releases, which
usually are security updates.

Notes:
- this patch only contains implementation for AWS and GCP, and Azure and
  IBM should also be implemented
- a CI mechanism should be built to automatically build all images when
  there is a change in the startup scripts (either Github upon merge to
master or TeamCity nightly runs)
- there is currently no built-in way to deprecate/cleanup previous
  images since they might still be used on older branches; a cleanup
  routine should be considered if/when the number of images get out of
  hand

Beyond this first iteration, a concept of "pre-bake only snippets"
should come next: snippets that are only executed at pre-baking time and
not at runtime even if there is no pre-baked image.
These snippets would contain adhoc roachtest setups
(building/pre-installing third party tools like Prometheus/Grafana,
Jepsen, Kafka CLI, ...), which would remove the need for these tests to
build/install at third party dependencies at runtime if the test is
running on an instance supported by a pre-baked image (see cockroachdb#62066 as an
example).

Epic: none
Informs: cockroachdb#150144
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants