Skip to content

Conversation

@a-thomas-22
Copy link

@a-thomas-22 a-thomas-22 commented Jan 2, 2026

The critical-op PodDisruptionBudget was previously created permanently, but its selector (critical-operation=true) matched no pods during normal operation. This caused false alerts in monitoring systems like kube-prometheus-stack because the PDB expected healthy pods but none matched.

Changes:

  • Modified syncCriticalOpPodDisruptionBudget to check if any pods have the critical-operation label before creating/keeping the PDB
  • PDB is now created on-demand when pods are labeled (e.g., during major version upgrades) and deleted when labels are removed
  • Updated majorVersionUpgrade to explicitly create/delete the PDB around the critical operation for immediate protection
  • Removed automatic critical-op PDB creation from initial cluster setup
  • Added test to verify on-demand PDB creation and deletion behavior

The explicit PDB creation in majorVersionUpgrade ensures immediate protection before the critical operation starts. The sync function serves as a safety net for edge cases like bootstrap (where Patroni applies labels) or operator restarts during critical operations.

Fixes #3020

@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

The critical-op PodDisruptionBudget was previously created permanently,
but its selector (critical-operation=true) matched no pods during normal
operation. This caused false alerts in monitoring systems like
kube-prometheus-stack because the PDB expected healthy pods but none
matched.

Changes:
- Modified syncCriticalOpPodDisruptionBudget to check if any pods have
  the critical-operation label before creating/keeping the PDB
- PDB is now created on-demand when pods are labeled (e.g., during
  major version upgrades) and deleted when labels are removed
- Updated majorVersionUpgrade to explicitly create/delete the PDB
  around the critical operation for immediate protection
- Removed automatic critical-op PDB creation from initial cluster setup
- Added test to verify on-demand PDB creation and deletion behavior,
  including edge cases for idempotent create/delete operations

The explicit PDB creation in majorVersionUpgrade ensures immediate
protection before the critical operation starts. The sync function
serves as a safety net for edge cases like bootstrap (where Patroni
applies labels) or operator restarts during critical operations.

Fixes zalando#3020
@a-thomas-22 a-thomas-22 force-pushed the fix/critical-op-pdb-on-demand branch from 1caf79b to 513291c Compare January 2, 2026 21:05
@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@a-thomas-22 a-thomas-22 marked this pull request as ready for review January 2, 2026 21:08
@FxKu FxKu added the minor label Jan 8, 2026
@FxKu FxKu added this to the 1.15.2 milestone Jan 8, 2026
@FxKu FxKu moved this to Waiting for review in Postgres Operator Jan 8, 2026
@FxKu
Copy link
Member

FxKu commented Jan 9, 2026

Thanks for your contribution. We did not anticipate that such a PDB can cause these issue. We thought it's a smart to opt-in and outs to it if we have to 😃

Unit tests are currently failing. Can you fix them, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Waiting for review

Development

Successfully merging this pull request may close these issues.

Newly introduced critical-op PDB causes tons of alerts with kube-prometheus-stack

3 participants