Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines#2705
Fix ~75% flake rate in upgrade-mysql and upgrade-postgres pipelines#2705
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces fixed sleeps with a CPI-aware wait-for-agents task and script, adds retryable Zookeeper deploys with configurable attempts/extra args, reduces Zookeeper canaries and instance count, and forces BOSH deployment deletions with Changes
Sequence Diagram(s)sequenceDiagram
participant Pipeline as CI Pipeline
participant WaitTask as wait-for-agents.sh
participant BOSH_CLI as bosh-cli
participant Director as BOSH Director
participant VMs as Zookeeper VMs
Pipeline->>WaitTask: run wait-for-agents task (inputs: director-state,...)
WaitTask->>WaitTask: derive BOSH_ENVIRONMENT, export creds/env
WaitTask->>BOSH_CLI: bosh-cli -d zookeeper vms --json
BOSH_CLI->>Director: query VM state
Director-->>BOSH_CLI: VM list (JSON)
BOSH_CLI-->>WaitTask: JSON output
WaitTask->>WaitTask: count total vs running via jq
alt all present & running
WaitTask-->>Pipeline: exit 0 (agents ready)
else timeout or failure
WaitTask->>BOSH_CLI: final bosh-cli -d zookeeper vms (best-effort)
WaitTask-->>Pipeline: exit 1 (agents not ready)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/tasks/wait-for-agents.sh`:
- Around line 31-37: The vms_json variable is being assigned with stderr merged
into it (bosh-cli ... --json 2>&1), which can corrupt JSON for the subsequent jq
calls (used on vms_json) and cause jq parse failures; remove the `2>&1` so
stderr remains separate, keep the exit code check using exit_code and set -e
as-is, and when passing vms_json into jq (e.g., lines using `echo "$vms_json" |
jq -r ...`) replace the echo with `printf '%s' "$vms_json" | jq -r ...` to
safely handle special characters and line endings while preserving correct JSON
input for the jq expressions that compute total and running.
- Around line 36-42: The readiness check currently counts all VM rows from
vms_json (variables total and running) which includes non-zookeeper errands;
update the jq selectors used for total and running to first filter Rows by the
instance name starting with "zookeeper/" (e.g., select(.instance |
startswith("zookeeper/"))) so both total and running only consider zookeeper
instances when comparing counts; modify the two expressions that compute total
and running (they reference vms_json and the jq pipeline) to apply this filter
before counting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0057b2ee-e938-4fff-ab15-f81ddac204c2
📒 Files selected for processing (6)
ci/bats/tasks/destroy-director.shci/pipeline.ymlci/tasks/deploy-zookeeper.shci/tasks/deploy-zookeeper/zookeeper-manifest.ymlci/tasks/wait-for-agents.shci/tasks/wait-for-agents.yml
f6d7148 to
049e6cf
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/tasks/wait-for-agents.sh`:
- Around line 17-20: Three environment exports use command substitution directly
(export BOSH_ENVIRONMENT=$(get_bosh_environment), export
BOSH_CLIENT_SECRET=$(bosh-cli ...), export BOSH_CA_CERT=$(bosh-cli ...)) which
can mask failures; change each to perform the command substitution into a
variable first (e.g., BOSH_ENVIRONMENT=$(get_bosh_environment);
BOSH_CLIENT_SECRET=$(bosh-cli int ... --path /admin_password);
BOSH_CA_CERT=$(bosh-cli int ... --path /director_ssl/ca)) and then export the
variable (export BOSH_ENVIRONMENT; export BOSH_CLIENT_SECRET; export
BOSH_CA_CERT); also add a simple check after each assignment to fail fast if the
variable is empty or the command failed (exit with non‑zero and log an error) to
ensure get_bosh_environment and bosh-cli errors are not masked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 981eadb8-e5cb-450b-affa-71562a9231f8
📒 Files selected for processing (6)
ci/bats/tasks/destroy-director.shci/pipeline.ymlci/tasks/deploy-zookeeper.shci/tasks/deploy-zookeeper/zookeeper-manifest.ymlci/tasks/wait-for-agents.shci/tasks/wait-for-agents.yml
049e6cf to
b94f24d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
ci/tasks/wait-for-agents.sh (1)
17-20:⚠️ Potential issue | 🟠 MajorSeparate variable declaration from export to avoid masking failures.
Lines 17, 19, and 20 use
export VAR=$(...)which masks command substitution failures. Ifget_bosh_environmentorbosh-cli intfails,exportstill succeeds, hiding the error despiteset -e. This was flagged in a previous review and remains unaddressed.🛡️ Proposed fix
-export BOSH_ENVIRONMENT=$(get_bosh_environment) +BOSH_ENVIRONMENT="$(get_bosh_environment)" +export BOSH_ENVIRONMENT export BOSH_CLIENT=admin -export BOSH_CLIENT_SECRET=$(bosh-cli int director-state/director-creds.yml --path /admin_password) -export BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca) +BOSH_CLIENT_SECRET="$(bosh-cli int director-state/director-creds.yml --path /admin_password)" +export BOSH_CLIENT_SECRET +BOSH_CA_CERT="$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca)" +export BOSH_CA_CERT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/tasks/wait-for-agents.sh` around lines 17 - 20, The current exports inline command substitutions (export BOSH_ENVIRONMENT=$(get_bosh_environment), export BOSH_CLIENT_SECRET=$(bosh-cli int ...), export BOSH_CA_CERT=$(bosh-cli int ...)) which can mask failures; change each to a two-step sequence: first assign the result to the variable (e.g., BOSH_ENVIRONMENT=$(get_bosh_environment); BOSH_CLIENT_SECRET=$(bosh-cli int director-state/director-creds.yml --path /admin_password); BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path /director_ssl/ca)), then export them (export BOSH_ENVIRONMENT BOSH_CLIENT_SECRET BOSH_CA_CERT), and optionally assert non-empty values after assignment (fail with a clear error if get_bosh_environment or bosh-cli int returned empty) to ensure command substitution failures are not hidden.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/tasks/deploy-zookeeper.sh`:
- Around line 31-55: The deploy script uses environment variables
DEPLOY_EXTRA_ARGS, MAX_DEPLOY_ATTEMPTS, and DEPLOY_RETRY_DELAY with defaults but
those params are not declared in the task YAML, so they can never be overridden;
either add them to the task's params block (declare DEPLOY_EXTRA_ARGS,
MAX_DEPLOY_ATTEMPTS with default 3, and DEPLOY_RETRY_DELAY with default 60) so
the pipeline can pass values, or remove the parameterization and hardcode the
intended defaults directly in the script and stop referencing DEPLOY_EXTRA_ARGS
/ MAX_DEPLOY_ATTEMPTS / DEPLOY_RETRY_DELAY as external parameters.
- Line 37: The unquoted ${DEPLOY_EXTRA_ARGS:-} in the bosh-cli deploy command is
intentionally word-split to allow multiple flags; add a short inline comment
above the bosh-cli -d zookeeper deploy line stating this is deliberate and that
multiple args (e.g., --skip-drain --fix) are expected, and also add a ShellCheck
suppression for SC2086 (e.g., a "# shellcheck disable=SC2086" comment) so the
intent is clear to future readers; reference DEPLOY_EXTRA_ARGS and the "bosh-cli
-d zookeeper deploy" invocation when adding the comment and suppression.
In `@ci/tasks/wait-for-agents.yml`:
- Around line 1-14: Remove the unused declarations: delete the params entry
"CPI" and the inputs entry "bosh-deployment" from the task header since
wait-for-agents.sh only consumes "director-state", "bosh-ci", and "bosh-cli";
alternatively, if they are intentionally reserved for future use, add a brief
comment in the task explaining why "CPI" and "bosh-deployment" are declared but
unused so reviewers won’t be confused (refer to the task run entry referencing
wait-for-agents.sh and the input names "director-state", "bosh-ci", "bosh-cli"
when locating the affected section).
---
Duplicate comments:
In `@ci/tasks/wait-for-agents.sh`:
- Around line 17-20: The current exports inline command substitutions (export
BOSH_ENVIRONMENT=$(get_bosh_environment), export BOSH_CLIENT_SECRET=$(bosh-cli
int ...), export BOSH_CA_CERT=$(bosh-cli int ...)) which can mask failures;
change each to a two-step sequence: first assign the result to the variable
(e.g., BOSH_ENVIRONMENT=$(get_bosh_environment); BOSH_CLIENT_SECRET=$(bosh-cli
int director-state/director-creds.yml --path /admin_password);
BOSH_CA_CERT=$(bosh-cli int director-state/director-creds.yml --path
/director_ssl/ca)), then export them (export BOSH_ENVIRONMENT BOSH_CLIENT_SECRET
BOSH_CA_CERT), and optionally assert non-empty values after assignment (fail
with a clear error if get_bosh_environment or bosh-cli int returned empty) to
ensure command substitution failures are not hidden.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: dcb20c52-55f9-4960-85b4-812a03826b7a
📒 Files selected for processing (6)
ci/bats/tasks/destroy-director.shci/pipeline.ymlci/tasks/deploy-zookeeper.shci/tasks/deploy-zookeeper/zookeeper-manifest.ymlci/tasks/wait-for-agents.shci/tasks/wait-for-agents.yml
6b6b1fe to
d9951f1
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
ci/tasks/wait-for-agents.yml (1)
4-8: 🧹 Nitpick | 🔵 TrivialUnused
bosh-deploymentinput.The
wait-for-agents.shscript only usesdirector-state,bosh-ci, andbosh-cli. Thebosh-deploymentinput on line 8 is not consumed by the script. Consider removing it to reduce confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/tasks/wait-for-agents.yml` around lines 4 - 8, Remove the unused `bosh-deployment` input from the inputs list (the entry named "bosh-deployment") in the CI task YAML since the `wait-for-agents.sh` script only consumes `director-state`, `bosh-ci`, and `bosh-cli`; update the inputs block to eliminate that line and run a quick grep for "bosh-deployment" and review the `wait-for-agents.sh` invocation to ensure no other code expects that input before committing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/pipeline.yml`:
- Around line 913-917: The pipeline step for the wait-for-agents task passes an
unused parameter "CPI: gcp"; remove the unused param from the wait-for-agents
task invocation (the block with "- task: wait-for-agents" and params) so the job
no longer forwards CPI, or if the task actually requires it, update the task
implementation in bosh-ci/ci/tasks/wait-for-agents.yml to consume the CPI param;
specifically edit the wait-for-agents task call to drop the CPI entry (or add
handling for CPI in the wait-for-agents task code) to eliminate the unused
parameter warning.
- Around line 807-811: The pipeline passes an unused CPI parameter to the
wait-for-agents task; either remove the CPI: gcp param from the task declaration
or make its use explicit in the task/script: update the "wait-for-agents" task
invocation to drop the CPI param, or modify the task's script
(wait-for-agents.sh) and task template to consume $CPI (or add a short comment
in the task file explaining it's intentionally reserved for future CPI-specific
logic) so the parameter is not silently unused.
In `@ci/tasks/wait-for-agents.sh`:
- Around line 42-44: The success log in the wait loop uses $((i *
SLEEP_INTERVAL)) to report elapsed time which overstates time because the sleep
for iteration i hasn't occurred yet; update the echo to compute elapsed as $(((i
- 1) * SLEEP_INTERVAL)) or alternatively log the attempt number (i)
instead—modify the branch that checks if [ "$running" -eq "$total" ] && [
"$total" -gt 0 ] so it prints the corrected elapsed calculation using
SLEEP_INTERVAL and i (or prints "after attempt $i") rather than $((i *
SLEEP_INTERVAL)).
---
Duplicate comments:
In `@ci/tasks/wait-for-agents.yml`:
- Around line 4-8: Remove the unused `bosh-deployment` input from the inputs
list (the entry named "bosh-deployment") in the CI task YAML since the
`wait-for-agents.sh` script only consumes `director-state`, `bosh-ci`, and
`bosh-cli`; update the inputs block to eliminate that line and run a quick grep
for "bosh-deployment" and review the `wait-for-agents.sh` invocation to ensure
no other code expects that input before committing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a812dd79-125a-46e8-b950-e3235101861c
📒 Files selected for processing (7)
ci/bats/tasks/destroy-director.shci/pipeline.ymlci/tasks/deploy-zookeeper.shci/tasks/deploy-zookeeper.ymlci/tasks/deploy-zookeeper/zookeeper-manifest.ymlci/tasks/wait-for-agents.shci/tasks/wait-for-agents.yml
4d7d0a6 to
3149fdf
Compare
3149fdf to
fcc2a12
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
ci/tasks/wait-for-agents.yml (1)
4-8: 🧹 Nitpick | 🔵 TrivialDrop the unused
bosh-deploymentinput.
wait-for-agents.shonly consumesdirector-state,bosh-ci, andbosh-cli. Keepingbosh-deploymentin the task contract makes the step look coupled to a manifest it never reads.♻️ Proposed cleanup
inputs: - name: director-state - name: bosh-ci - name: bosh-cli -- name: bosh-deployment🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/tasks/wait-for-agents.yml` around lines 4 - 8, The task declares an unused input "bosh-deployment" which should be removed from the task contract; open the ci/tasks/wait-for-agents.yml task inputs block and delete the line "- name: bosh-deployment", confirm there are no usages of "bosh-deployment" in the task (e.g., in wait-for-agents.sh or task params), and leave only the required inputs "director-state", "bosh-ci", and "bosh-cli".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/bats/tasks/destroy-director.sh`:
- Line 31: The pipeline's jq expression (.Tables[0].Rows[].name) can fail when
.Tables[0] or .Rows is missing; change the jq filter to provide a safe fallback
(e.g., use .Tables[0].Rows // [] or .Tables[0].Rows? // []) so it yields an
empty array when deployments are absent, and keep the rest of the pipeline
(xargs -n1 -I % bosh-cli -n -d % delete-deployment --force) the same; update the
jq fragment in the command that currently uses ".Tables[0].Rows[].name" to a
robust variant like ".Tables[0].Rows // [] | .[].name" (or with `?` null-safe
operator) so the script handles missing fields gracefully.
- Line 31: Update the xargs invocation that runs bosh-cli delete-deployment:
quote the xargs placeholder '%' so each deployment name is passed as a single
argument even if it contains spaces or special chars; locate the pipeline that
uses bosh-cli deployments --column name --json | jq -r ".Tables[0].Rows[].name"
| xargs -n1 -I % bosh-cli -n -d % delete-deployment --force and change the xargs
placeholder '%' to a quoted '%', ensuring the bosh-cli -n -d % delete-deployment
--force call receives the entire deployment name atomically.
In `@ci/tasks/deploy-zookeeper.sh`:
- Around line 48-55: The deploy script currently exits in the if-block when
attempt == MAX_DEPLOY_ATTEMPTS before printing the diagnostic VM state; move or
duplicate the diagnostic call so the bosh-cli -d zookeeper vms || true is
executed before exiting on the final attempt (i.e., inside the if [ "$attempt"
-eq "$MAX_DEPLOY_ATTEMPTS" ] branch, call the VM state dump then exit 1),
keeping the existing retry/wait echo and DEPLOY_RETRY_DELAY logic for non-final
attempts.
---
Duplicate comments:
In `@ci/tasks/wait-for-agents.yml`:
- Around line 4-8: The task declares an unused input "bosh-deployment" which
should be removed from the task contract; open the ci/tasks/wait-for-agents.yml
task inputs block and delete the line "- name: bosh-deployment", confirm there
are no usages of "bosh-deployment" in the task (e.g., in wait-for-agents.sh or
task params), and leave only the required inputs "director-state", "bosh-ci",
and "bosh-cli".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 37e2a25d-7dde-4532-9d1c-e012abcbc75a
📒 Files selected for processing (7)
ci/bats/tasks/destroy-director.shci/pipeline.ymlci/tasks/deploy-zookeeper.shci/tasks/deploy-zookeeper.ymlci/tasks/deploy-zookeeper/zookeeper-manifest.ymlci/tasks/wait-for-agents.shci/tasks/wait-for-agents.yml
fcc2a12 to
89538fc
Compare
Both upgrade pipelines have been failing approximately 75% of the time
since at least 2026-03-11. Analysis of 41 failed builds across both
pipelines identified three failure categories, all stemming from BOSH
agent unresponsiveness after a director upgrade.
Examined builds since 2026-03-11:
- upgrade-mysql: 19 failures / 25 builds (76% failure rate)
- upgrade-postgres: 22 failures / 30 builds (73% failure rate)
After the director is upgraded (create-env), agents deployed by the
old director become unresponsive to the new director. The pipeline
had a blind `sleep 300` before attempting to redeploy zookeeper, but
agents were still not reconnected when the deploy started.
- MySQL builds: `run_script` (pre-stop) times out after 45s during
the update of the 3rd-5th zookeeper instance. 2-4 instances update
successfully before one agent fails to respond.
- Postgres builds: `get_state` times out during `Preparing deployment`
before any instance updates begin. This suggests agents take longer
to reconnect when using the internal Postgres DB.
`pd-ssd` disk type incompatible with `c4a-standard-1` machine type.
Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in
mid-March. Already resolved externally — no action needed.
When agent timeouts cause the deploy to fail, the ensure teardown's
`delete-deployment` also times out on the same unresponsive agents.
VMs remain attached to the subnetwork, causing Terraform destroy to
fail with `resourceInUseByAnotherResource`.
1. **Replace blind sleep with active agent health check**
(ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml)
New `wait-for-agents` task polls `bosh vms` every 10s for up to
600s, waiting until all zookeeper agents report `process_state:
running`. Replaces the blind `sleep 300` in both upgrade-postgres
and upgrade-mysql pipeline jobs. Fails fast with diagnostic output
if agents never reconnect.
2. **Add retry logic to zookeeper deploy**
(ci/tasks/deploy-zookeeper.sh)
Wrap `bosh deploy --recreate` in a retry loop (3 attempts, 60s
delay between retries). On failure, logs current VM state for
diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and
DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to
allow passing --skip-drain from the pipeline if needed.
3. **Use --force for teardown delete-deployment**
(ci/bats/tasks/destroy-director.sh)
Add `--force` flag to `delete-deployment` in the teardown script.
This skips drain and pre-stop lifecycle hooks during cleanup,
preventing the cascade failure where teardown times out on the
same unresponsive agents that caused the original deploy failure.
4. **Reduce zookeeper instances from 5 to 3**
(ci/tasks/deploy-zookeeper/zookeeper-manifest.yml)
3 instances is the minimum ZooKeeper quorum and is sufficient to
validate the upgrade path. Fewer instances means lower probability
of hitting an unresponsive agent, and faster update cycles.
Canaries reduced from 2 to 1 accordingly.
The root cause — why agents become unresponsive after a director
upgrade — remains uninvestigated. Likely candidates:
- NATS server restarts during create-env, and agents exhaust their
reconnect attempts (max_reconnect_attempts=4, reconnect_time_wait=2s
in nats_rpc.rb) before NATS comes back up
- NATS CA certificate rotation during upgrade invalidates existing
agent TLS certificates (chicken-and-egg: can't push new certs to
agents that can't connect)
- The health monitor's scan-and-fix process runs during the sleep
window and may interfere with agent state
Made-with: Cursor
89538fc to
89da57d
Compare
Both upgrade pipelines have been failing approximately 75% of the time since at least 2026-03-11. Analysis of 41 failed builds across both pipelines identified three failure categories, all stemming from BOSH agent unresponsiveness after a director upgrade.
Failure Analysis
Examined builds since 2026-03-11:
Category 1: Agent timeout during recreate-zookeeper (34/41 failures)
After the director is upgraded (create-env), agents deployed by the old director become unresponsive to the new director. The pipeline had a blind
sleep 300before attempting to redeploy zookeeper, but agents were still not reconnected when the deploy started.run_script(pre-stop) times out after 45s during the update of the 3rd-5th zookeeper instance. 2-4 instances update successfully before one agent fails to respond.get_statetimes out duringPreparing deploymentbefore any instance updates begin. This suggests agents take longer to reconnect when using the internal Postgres DB.Category 2: GCP VM type incompatibility (7/41 failures)
pd-ssddisk type incompatible withc4a-standard-1machine type. Only affected builds 250-253 (MySQL) and 458-460 (Postgres) in mid-March. Already resolved externally — no action needed.Category 3: Terraform destroy cascade failure (2/41 failures)
When agent timeouts cause the deploy to fail, the ensure teardown's
delete-deploymentalso times out on the same unresponsive agents. VMs remain attached to the subnetwork, causing Terraform destroy to fail withresourceInUseByAnotherResource.Changes
Replace blind sleep with active agent health check (ci/tasks/wait-for-agents.{sh,yml}, ci/pipeline.yml)
New
wait-for-agentstask pollsbosh vmsevery 10s for up to 600s, waiting until all zookeeper agents reportprocess_state: running. Replaces the blindsleep 300in both upgrade-postgres and upgrade-mysql pipeline jobs. Fails fast with diagnostic output if agents never reconnect.Add retry logic to zookeeper deploy (ci/tasks/deploy-zookeeper.sh)
Wrap
bosh deploy --recreatein a retry loop (3 attempts, 60s delay between retries). On failure, logs current VM state for diagnostics. Controlled by MAX_DEPLOY_ATTEMPTS and DEPLOY_RETRY_DELAY env vars. Also adds DEPLOY_EXTRA_ARGS to allow passing --skip-drain from the pipeline if needed.Use --force for teardown delete-deployment (ci/bats/tasks/destroy-director.sh)
Add
--forceflag todelete-deploymentin the teardown script. This skips drain and pre-stop lifecycle hooks during cleanup, preventing the cascade failure where teardown times out on the same unresponsive agents that caused the original deploy failure.Reduce zookeeper instances from 5 to 3 (ci/tasks/deploy-zookeeper/zookeeper-manifest.yml)
3 instances is the minimum ZooKeeper quorum and is sufficient to validate the upgrade path. Fewer instances means lower probability of hitting an unresponsive agent, and faster update cycles. Canaries reduced from 2 to 1 accordingly.
Further Investigation Needed
The root cause — why agents become unresponsive after a director upgrade — remains uninvestigated. Likely candidates:
Made-with: Cursor