Skip to content

Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines#2705

Merged
beyhan merged 1 commit intomainfrom
upgrade-spec-improvements
Apr 2, 2026
Merged

Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines#2705
beyhan merged 1 commit intomainfrom
upgrade-spec-improvements

Conversation

@aramprice
Copy link
Copy Markdown
Member

Both upgrade pipelines have been failing approximately 75% of the time since at least 2026-03-11. Analysis of 41 failed builds across both pipelines identified three failure categories, all stemming from BOSH agent unresponsiveness after a director upgrade.

Failure Analysis

Examined builds since 2026-03-11:

  • upgrade-mysql: 19 failures / 25 builds (76% failure rate)
  • upgrade-postgres: 22 failures / 30 builds (73% failure rate)

Category 1: Agent timeout during recreate-zookeeper (34/41 failures)

After the director is upgraded (create-env), agents deployed by the old director become unresponsive to the new director. The pipeline had a blind sleep 300 before attempting to redeploy zookeeper, but agents were still not reconnected when the deploy started.

  • MySQL builds: run_script (pre-stop) times out after 45s during the update of the 3rd-5th zookeeper instance. 2-4 instances update successfully before one agent fails to respond.
  • Postgres builds: get_state times out during Preparing deployment before any instance updates begin. This suggests agents take longer to reconnect when using the internal Postgres DB.

Category 2: GCP VM type incompatibility (7/41 failures)

pd-ssd disk type incompatible with c4a-standard-1 machine type. Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in mid-March. Already resolved externally — no action needed.

Category 3: Terraform destroy cascade failure (2/41 failures)

When agent timeouts cause the deploy to fail, the ensure teardown's delete-deployment also times out on the same unresponsive agents. VMs remain attached to the subnetwork, causing Terraform destroy to fail with resourceInUseByAnotherResource.

Changes

  1. Replace blind sleep with active agent health check (ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml)

    New wait-for-agents task polls bosh vms every 10s for up to 600s, waiting until all zookeeper agents report process_state: running. Replaces the blind sleep 300 in both upgrade-postgres and upgrade-mysql pipeline jobs. Fails fast with diagnostic output if agents never reconnect.

  2. Add retry logic to zookeeper deploy (ci/tasks/deploy-zookeeper.sh)

    Wrap bosh deploy --recreate in a retry loop (3 attempts, 60s delay between retries). On failure, logs current VM state for diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to allow passing --skip-drain from the pipeline if needed.

  3. Use --force for teardown delete-deployment (ci/bats/tasks/destroy-director.sh)

    Add --force flag to delete-deployment in the teardown script. This skips drain and pre-stop lifecycle hooks during cleanup, preventing the cascade failure where teardown times out on the same unresponsive agents that caused the original deploy failure.

  4. Reduce zookeeper instances from 5 to 3 (ci/tasks/deploy-zookeeper/zookeeper-manifest.yml)

    3 instances is the minimum ZooKeeper quorum and is sufficient to validate the upgrade path. Fewer instances means lower probability of hitting an unresponsive agent, and faster update cycles. Canaries reduced from 2 to 1 accordingly.

Further Investigation Needed

The root cause — why agents become unresponsive after a director upgrade — remains uninvestigated. Likely candidates:

  • NATS server restarts during create-env, and agents exhaust their reconnect attempts (max_reconnect_attempts=4, reconnect_time_wait=2s in nats_rpc.rb) before NATS comes back up
  • NATS CA certificate rotation during upgrade invalidates existing agent TLS certificates (chicken-and-egg: can't push new certs to agents that can't connect)
  • The health monitor's scan-and-fix process runs during the sleep window and may interfere with agent state

Made-with: Cursor

@aramprice aramprice requested review from a team, a-hassanin, mariash, mkocher and selzoc and removed request for a team, a-hassanin, mariash and mkocher April 1, 2026 14:03
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces fixed sleeps with a CPI-aware wait-for-agents task and script, adds retryable Zookeeper deploys with configurable attempts/extra args, reduces Zookeeper canaries and instance count, and forces BOSH deployment deletions with --force.

Changes

Cohort / File(s) Summary
BOSH Deletion Change
ci/bats/tasks/destroy-director.sh
Adds --force to bosh-cli delete-deployment when iterating director deployments.
Pipeline: Sleep → Agent Wait
ci/pipeline.yml
Replaces inline sleep-300-seconds steps with invocations of the wait-for-agents task for upgrade-postgres and upgrade-mysql.
Agent Wait Task
ci/tasks/wait-for-agents.sh, ci/tasks/wait-for-agents.yml
Adds a new CI task and script that configures BOSH CLI from director-state, exports env/creds, polls bosh-cli -d zookeeper vms --json (up to 60×10s) and succeeds when all zookeeper instances are present and process_state == "running", otherwise fails on timeout.
Zookeeper Deploy Task & Script
ci/tasks/deploy-zookeeper.yml, ci/tasks/deploy-zookeeper.sh
Adds params DEPLOY_EXTRA_ARGS, MAX_DEPLOY_ATTEMPTS, DEPLOY_RETRY_DELAY. Script retries bosh-cli -d zookeeper deploy up to configured attempts (default 3) with delay (default 60s), preserves unquoted ${DEPLOY_EXTRA_ARGS:-}, logs VM state on failure, and exits non‑zero after final failure.
Zookeeper Manifest Config
ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
Adjusts deployment sizing/update strategy: instance_groups[0].instances 5 → 3 and update.canaries 2 → 1; other update/watch settings unchanged.

Sequence Diagram(s)

sequenceDiagram
  participant Pipeline as CI Pipeline
  participant WaitTask as wait-for-agents.sh
  participant BOSH_CLI as bosh-cli
  participant Director as BOSH Director
  participant VMs as Zookeeper VMs

  Pipeline->>WaitTask: run wait-for-agents task (inputs: director-state,...)
  WaitTask->>WaitTask: derive BOSH_ENVIRONMENT, export creds/env
  WaitTask->>BOSH_CLI: bosh-cli -d zookeeper vms --json
  BOSH_CLI->>Director: query VM state
  Director-->>BOSH_CLI: VM list (JSON)
  BOSH_CLI-->>WaitTask: JSON output
  WaitTask->>WaitTask: count total vs running via jq
  alt all present & running
    WaitTask-->>Pipeline: exit 0 (agents ready)
  else timeout or failure
    WaitTask->>BOSH_CLI: final bosh-cli -d zookeeper vms (best-effort)
    WaitTask-->>Pipeline: exit 1 (agents not ready)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nudged the pipeline with a twitch and a hop,
Waiting on agents till the last processes stop,
I retried the deploys with patience and flair,
Forced the old ghosts out with a confident snare,
Now three burrows hum where five used to flop.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing the ~75% flake rate in two CI pipelines through targeted improvements.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering failure analysis, root causes, implemented changes, and caveats. It exceeds template requirements with detailed context and reasoning.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch upgrade-spec-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/tasks/wait-for-agents.sh`:
- Around line 31-37: The vms_json variable is being assigned with stderr merged
into it (bosh-cli ... --json 2>&1), which can corrupt JSON for the subsequent jq
calls (used on vms_json) and cause jq parse failures; remove the `2>&1` so
stderr remains separate, keep the exit code check using exit_code and set -e
as-is, and when passing vms_json into jq (e.g., lines using `echo "$vms_json" |
jq -r ...`) replace the echo with `printf '%s' "$vms_json" | jq -r ...` to
safely handle special characters and line endings while preserving correct JSON
input for the jq expressions that compute total and running.
- Around line 36-42: The readiness check currently counts all VM rows from
vms_json (variables total and running) which includes non-zookeeper errands;
update the jq selectors used for total and running to first filter Rows by the
instance name starting with "zookeeper/" (e.g., select(.instance |
startswith("zookeeper/"))) so both total and running only consider zookeeper
instances when comparing counts; modify the two expressions that compute total
and running (they reference vms_json and the jq pipeline) to apply this filter
before counting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0057b2ee-e938-4fff-ab15-f81ddac204c2

📥 Commits

Reviewing files that changed from the base of the PR and between 3f4620e and f6d7148.

📒 Files selected for processing (6)
  • ci/bats/tasks/destroy-director.sh
  • ci/pipeline.yml
  • ci/tasks/deploy-zookeeper.sh
  • ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
  • ci/tasks/wait-for-agents.sh
  • ci/tasks/wait-for-agents.yml

@aramprice aramprice force-pushed the upgrade-spec-improvements branch from f6d7148 to 049e6cf Compare April 1, 2026 14:26
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/tasks/wait-for-agents.sh`:
- Around line 17-20: Three environment exports use command substitution directly
(export BOSH_ENVIRONMENT=$(get_bosh_environment), export
BOSH_CLIENT_SECRET=$(bosh-cli ...), export BOSH_CA_CERT=$(bosh-cli ...)) which
can mask failures; change each to perform the command substitution into a
variable first (e.g., BOSH_ENVIRONMENT=$(get_bosh_environment);
BOSH_CLIENT_SECRET=$(bosh-cli int ... --path /admin_password);
BOSH_CA_CERT=$(bosh-cli int ... --path /director_ssl/ca)) and then export the
variable (export BOSH_ENVIRONMENT; export BOSH_CLIENT_SECRET; export
BOSH_CA_CERT); also add a simple check after each assignment to fail fast if the
variable is empty or the command failed (exit with non‑zero and log an error) to
ensure get_bosh_environment and bosh-cli errors are not masked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 981eadb8-e5cb-450b-affa-71562a9231f8

📥 Commits

Reviewing files that changed from the base of the PR and between f6d7148 and 049e6cf.

📒 Files selected for processing (6)
  • ci/bats/tasks/destroy-director.sh
  • ci/pipeline.yml
  • ci/tasks/deploy-zookeeper.sh
  • ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
  • ci/tasks/wait-for-agents.sh
  • ci/tasks/wait-for-agents.yml

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
ci/tasks/wait-for-agents.sh (1)

17-20: ⚠️ Potential issue | 🟠 Major

Separate variable declaration from export to avoid masking failures.

Lines 17, 19, and 20 use export VAR=$(...) which masks command substitution failures. If get_bosh_environment or bosh-cli int fails, export still succeeds, hiding the error despite set -e. This was flagged in a previous review and remains unaddressed.

🛡️ Proposed fix
-export BOSH_ENVIRONMENT=$(get_bosh_environment)
+BOSH_ENVIRONMENT="$(get_bosh_environment)"
+export BOSH_ENVIRONMENT
 export BOSH_CLIENT=admin
-export BOSH_CLIENT_SECRET=$(bosh-cli int director-state/director-creds.yml --path /admin_password)
-export BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca)
+BOSH_CLIENT_SECRET="$(bosh-cli int director-state/director-creds.yml --path /admin_password)"
+export BOSH_CLIENT_SECRET
+BOSH_CA_CERT="$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca)"
+export BOSH_CA_CERT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/tasks/wait-for-agents.sh` around lines 17 - 20, The current exports inline
command substitutions (export BOSH_ENVIRONMENT=$(get_bosh_environment), export
BOSH_CLIENT_SECRET=$(bosh-cli int ...), export BOSH_CA_CERT=$(bosh-cli int ...))
which can mask failures; change each to a two-step sequence: first assign the
result to the variable (e.g., BOSH_ENVIRONMENT=$(get_bosh_environment);
BOSH_CLIENT_SECRET=$(bosh-cli int director-state/director-creds.yml --path
/admin_password); BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml
--path /director_ssl/ca)), then export them (export BOSH_ENVIRONMENT
BOSH_CLIENT_SECRET BOSH_CA_CERT), and optionally assert non-empty values after
assignment (fail with a clear error if get_bosh_environment or bosh-cli int
returned empty) to ensure command substitution failures are not hidden.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/tasks/deploy-zookeeper.sh`:
- Around line 31-55: The deploy script uses environment variables
DEPLOY_EXTRA_ARGS, MAX_DEPLOY_ATTEMPTS, and DEPLOY_RETRY_DELAY with defaults but
those params are not declared in the task YAML, so they can never be overridden;
either add them to the task's params block (declare DEPLOY_EXTRA_ARGS,
MAX_DEPLOY_ATTEMPTS with default 3, and DEPLOY_RETRY_DELAY with default 60) so
the pipeline can pass values, or remove the parameterization and hardcode the
intended defaults directly in the script and stop referencing DEPLOY_EXTRA_ARGS
/ MAX_DEPLOY_ATTEMPTS / DEPLOY_RETRY_DELAY as external parameters.
- Line 37: The unquoted ${DEPLOY_EXTRA_ARGS:-} in the bosh-cli deploy command is
intentionally word-split to allow multiple flags; add a short inline comment
above the bosh-cli -d zookeeper deploy line stating this is deliberate and that
multiple args (e.g., --skip-drain --fix) are expected, and also add a ShellCheck
suppression for SC2086 (e.g., a "# shellcheck disable=SC2086" comment) so the
intent is clear to future readers; reference DEPLOY_EXTRA_ARGS and the "bosh-cli
-d zookeeper deploy" invocation when adding the comment and suppression.

In `@ci/tasks/wait-for-agents.yml`:
- Around line 1-14: Remove the unused declarations: delete the params entry
"CPI" and the inputs entry "bosh-deployment" from the task header since
wait-for-agents.sh only consumes "director-state", "bosh-ci", and "bosh-cli";
alternatively, if they are intentionally reserved for future use, add a brief
comment in the task explaining why "CPI" and "bosh-deployment" are declared but
unused so reviewers won’t be confused (refer to the task run entry referencing
wait-for-agents.sh and the input names "director-state", "bosh-ci", "bosh-cli"
when locating the affected section).

---

Duplicate comments:
In `@ci/tasks/wait-for-agents.sh`:
- Around line 17-20: The current exports inline command substitutions (export
BOSH_ENVIRONMENT=$(get_bosh_environment), export BOSH_CLIENT_SECRET=$(bosh-cli
int ...), export BOSH_CA_CERT=$(bosh-cli int ...)) which can mask failures;
change each to a two-step sequence: first assign the result to the variable
(e.g., BOSH_ENVIRONMENT=$(get_bosh_environment); BOSH_CLIENT_SECRET=$(bosh-cli
int director-state/director-creds.yml --path /admin_password);
BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path
/director_ssl/ca)), then export them (export BOSH_ENVIRONMENT BOSH_CLIENT_SECRET
BOSH_CA_CERT), and optionally assert non-empty values after assignment (fail
with a clear error if get_bosh_environment or bosh-cli int returned empty) to
ensure command substitution failures are not hidden.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dcb20c52-55f9-4960-85b4-812a03826b7a

📥 Commits

Reviewing files that changed from the base of the PR and between 049e6cf and b94f24d.

📒 Files selected for processing (6)
  • ci/bats/tasks/destroy-director.sh
  • ci/pipeline.yml
  • ci/tasks/deploy-zookeeper.sh
  • ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
  • ci/tasks/wait-for-agents.sh
  • ci/tasks/wait-for-agents.yml

@aramprice aramprice force-pushed the upgrade-spec-improvements branch 2 times, most recently from 6b6b1fe to d9951f1 Compare April 1, 2026 16:35
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
ci/tasks/wait-for-agents.yml (1)

4-8: 🧹 Nitpick | 🔵 Trivial

Unused bosh-deployment input.

The wait-for-agents.sh script only uses director-state, bosh-ci, and bosh-cli. The bosh-deployment input on line 8 is not consumed by the script. Consider removing it to reduce confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/tasks/wait-for-agents.yml` around lines 4 - 8, Remove the unused
`bosh-deployment` input from the inputs list (the entry named "bosh-deployment")
in the CI task YAML since the `wait-for-agents.sh` script only consumes
`director-state`, `bosh-ci`, and `bosh-cli`; update the inputs block to
eliminate that line and run a quick grep for "bosh-deployment" and review the
`wait-for-agents.sh` invocation to ensure no other code expects that input
before committing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/pipeline.yml`:
- Around line 913-917: The pipeline step for the wait-for-agents task passes an
unused parameter "CPI: gcp"; remove the unused param from the wait-for-agents
task invocation (the block with "- task: wait-for-agents" and params) so the job
no longer forwards CPI, or if the task actually requires it, update the task
implementation in bosh-ci/ci/tasks/wait-for-agents.yml to consume the CPI param;
specifically edit the wait-for-agents task call to drop the CPI entry (or add
handling for CPI in the wait-for-agents task code) to eliminate the unused
parameter warning.
- Around line 807-811: The pipeline passes an unused CPI parameter to the
wait-for-agents task; either remove the CPI: gcp param from the task declaration
or make its use explicit in the task/script: update the "wait-for-agents" task
invocation to drop the CPI param, or modify the task's script
(wait-for-agents.sh) and task template to consume $CPI (or add a short comment
in the task file explaining it's intentionally reserved for future CPI-specific
logic) so the parameter is not silently unused.

In `@ci/tasks/wait-for-agents.sh`:
- Around line 42-44: The success log in the wait loop uses $((i *
SLEEP_INTERVAL)) to report elapsed time which overstates time because the sleep
for iteration i hasn't occurred yet; update the echo to compute elapsed as $(((i
- 1) * SLEEP_INTERVAL)) or alternatively log the attempt number (i)
instead—modify the branch that checks if [ "$running" -eq "$total" ] && [
"$total" -gt 0 ] so it prints the corrected elapsed calculation using
SLEEP_INTERVAL and i (or prints "after attempt $i") rather than $((i *
SLEEP_INTERVAL)).

---

Duplicate comments:
In `@ci/tasks/wait-for-agents.yml`:
- Around line 4-8: Remove the unused `bosh-deployment` input from the inputs
list (the entry named "bosh-deployment") in the CI task YAML since the
`wait-for-agents.sh` script only consumes `director-state`, `bosh-ci`, and
`bosh-cli`; update the inputs block to eliminate that line and run a quick grep
for "bosh-deployment" and review the `wait-for-agents.sh` invocation to ensure
no other code expects that input before committing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a812dd79-125a-46e8-b950-e3235101861c

📥 Commits

Reviewing files that changed from the base of the PR and between b94f24d and d9951f1.

📒 Files selected for processing (7)
  • ci/bats/tasks/destroy-director.sh
  • ci/pipeline.yml
  • ci/tasks/deploy-zookeeper.sh
  • ci/tasks/deploy-zookeeper.yml
  • ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
  • ci/tasks/wait-for-agents.sh
  • ci/tasks/wait-for-agents.yml

@aramprice aramprice force-pushed the upgrade-spec-improvements branch 2 times, most recently from 4d7d0a6 to 3149fdf Compare April 1, 2026 16:59
@aramprice aramprice force-pushed the upgrade-spec-improvements branch from 3149fdf to fcc2a12 Compare April 1, 2026 17:04
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
ci/tasks/wait-for-agents.yml (1)

4-8: 🧹 Nitpick | 🔵 Trivial

Drop the unused bosh-deployment input.

wait-for-agents.sh only consumes director-state, bosh-ci, and bosh-cli. Keeping bosh-deployment in the task contract makes the step look coupled to a manifest it never reads.

♻️ Proposed cleanup
 inputs:
 - name: director-state
 - name: bosh-ci
 - name: bosh-cli
-- name: bosh-deployment
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/tasks/wait-for-agents.yml` around lines 4 - 8, The task declares an unused
input "bosh-deployment" which should be removed from the task contract; open the
ci/tasks/wait-for-agents.yml task inputs block and delete the line "- name:
bosh-deployment", confirm there are no usages of "bosh-deployment" in the task
(e.g., in wait-for-agents.sh or task params), and leave only the required inputs
"director-state", "bosh-ci", and "bosh-cli".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/bats/tasks/destroy-director.sh`:
- Line 31: The pipeline's jq expression (.Tables[0].Rows[].name) can fail when
.Tables[0] or .Rows is missing; change the jq filter to provide a safe fallback
(e.g., use .Tables[0].Rows // [] or .Tables[0].Rows? // []) so it yields an
empty array when deployments are absent, and keep the rest of the pipeline
(xargs -n1 -I % bosh-cli -n -d % delete-deployment --force) the same; update the
jq fragment in the command that currently uses ".Tables[0].Rows[].name" to a
robust variant like ".Tables[0].Rows // [] | .[].name" (or with `?` null-safe
operator) so the script handles missing fields gracefully.
- Line 31: Update the xargs invocation that runs bosh-cli delete-deployment:
quote the xargs placeholder '%' so each deployment name is passed as a single
argument even if it contains spaces or special chars; locate the pipeline that
uses bosh-cli deployments --column name --json | jq -r ".Tables[0].Rows[].name"
| xargs -n1 -I % bosh-cli -n -d % delete-deployment --force and change the xargs
placeholder '%' to a quoted '%', ensuring the bosh-cli -n -d % delete-deployment
--force call receives the entire deployment name atomically.

In `@ci/tasks/deploy-zookeeper.sh`:
- Around line 48-55: The deploy script currently exits in the if-block when
attempt == MAX_DEPLOY_ATTEMPTS before printing the diagnostic VM state; move or
duplicate the diagnostic call so the bosh-cli -d zookeeper vms || true is
executed before exiting on the final attempt (i.e., inside the if [ "$attempt"
-eq "$MAX_DEPLOY_ATTEMPTS" ] branch, call the VM state dump then exit 1),
keeping the existing retry/wait echo and DEPLOY_RETRY_DELAY logic for non-final
attempts.

---

Duplicate comments:
In `@ci/tasks/wait-for-agents.yml`:
- Around line 4-8: The task declares an unused input "bosh-deployment" which
should be removed from the task contract; open the ci/tasks/wait-for-agents.yml
task inputs block and delete the line "- name: bosh-deployment", confirm there
are no usages of "bosh-deployment" in the task (e.g., in wait-for-agents.sh or
task params), and leave only the required inputs "director-state", "bosh-ci",
and "bosh-cli".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 37e2a25d-7dde-4532-9d1c-e012abcbc75a

📥 Commits

Reviewing files that changed from the base of the PR and between d9951f1 and fcc2a12.

📒 Files selected for processing (7)
  • ci/bats/tasks/destroy-director.sh
  • ci/pipeline.yml
  • ci/tasks/deploy-zookeeper.sh
  • ci/tasks/deploy-zookeeper.yml
  • ci/tasks/deploy-zookeeper/zookeeper-manifest.yml
  • ci/tasks/wait-for-agents.sh
  • ci/tasks/wait-for-agents.yml

@aramprice aramprice force-pushed the upgrade-spec-improvements branch from fcc2a12 to 89538fc Compare April 1, 2026 17:24
Both upgrade pipelines have been failing approximately 75% of the time
since at least 2026-03-11. Analysis of 41 failed builds across both
pipelines identified three failure categories, all stemming from BOSH
agent unresponsiveness after a director upgrade.

Examined builds since 2026-03-11:
- upgrade-mysql: 19 failures / 25 builds (76% failure rate)
- upgrade-postgres: 22 failures / 30 builds (73% failure rate)

After the director is upgraded (create-env), agents deployed by the
old director become unresponsive to the new director. The pipeline
had a blind `sleep 300` before attempting to redeploy zookeeper, but
agents were still not reconnected when the deploy started.

- MySQL builds: `run_script` (pre-stop) times out after 45s during
  the update of the 3rd-5th zookeeper instance. 2-4 instances update
  successfully before one agent fails to respond.
- Postgres builds: `get_state` times out during `Preparing deployment`
  before any instance updates begin. This suggests agents take longer
  to reconnect when using the internal Postgres DB.

`pd-ssd` disk type incompatible with `c4a-standard-1` machine type.
Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in
mid-March. Already resolved externally — no action needed.

When agent timeouts cause the deploy to fail, the ensure teardown's
`delete-deployment` also times out on the same unresponsive agents.
VMs remain attached to the subnetwork, causing Terraform destroy to
fail with `resourceInUseByAnotherResource`.

1. **Replace blind sleep with active agent health check**
   (ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml)

   New `wait-for-agents` task polls `bosh vms` every 10s for up to
   600s, waiting until all zookeeper agents report `process_state:
   running`. Replaces the blind `sleep 300` in both upgrade-postgres
   and upgrade-mysql pipeline jobs. Fails fast with diagnostic output
   if agents never reconnect.

2. **Add retry logic to zookeeper deploy**
   (ci/tasks/deploy-zookeeper.sh)

   Wrap `bosh deploy --recreate` in a retry loop (3 attempts, 60s
   delay between retries). On failure, logs current VM state for
   diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and
   DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to
   allow passing --skip-drain from the pipeline if needed.

3. **Use --force for teardown delete-deployment**
   (ci/bats/tasks/destroy-director.sh)

   Add `--force` flag to `delete-deployment` in the teardown script.
   This skips drain and pre-stop lifecycle hooks during cleanup,
   preventing the cascade failure where teardown times out on the
   same unresponsive agents that caused the original deploy failure.

4. **Reduce zookeeper instances from 5 to 3**
   (ci/tasks/deploy-zookeeper/zookeeper-manifest.yml)

   3 instances is the minimum ZooKeeper quorum and is sufficient to
   validate the upgrade path. Fewer instances means lower probability
   of hitting an unresponsive agent, and faster update cycles.
   Canaries reduced from 2 to 1 accordingly.

The root cause — why agents become unresponsive after a director
upgrade — remains uninvestigated. Likely candidates:

- NATS server restarts during create-env, and agents exhaust their
  reconnect attempts (max_reconnect_attempts=4, reconnect_time_wait=2s
  in nats_rpc.rb) before NATS comes back up
- NATS CA certificate rotation during upgrade invalidates existing
  agent TLS certificates (chicken-and-egg: can't push new certs to
  agents that can't connect)
- The health monitor's scan-and-fix process runs during the sleep
  window and may interfere with agent state

Made-with: Cursor
@aramprice aramprice force-pushed the upgrade-spec-improvements branch from 89538fc to 89da57d Compare April 1, 2026 17:28
@github-project-automation github-project-automation bot moved this from Waiting for Changes | Open for Contribution to Pending Merge | Prioritized in Foundational Infrastructure Working Group Apr 1, 2026
@beyhan beyhan merged commit bd996d4 into main Apr 2, 2026
20 checks passed
@beyhan beyhan deleted the upgrade-spec-improvements branch April 2, 2026 14:51
@github-project-automation github-project-automation bot moved this from Pending Merge | Prioritized to Done in Foundational Infrastructure Working Group Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

4 participants