Skip to content

Conversation

@ShazaAldawamneh
Copy link

Rate-limit RequiredInstallerResourcesMissing events in InstallerController

During SNO upgrades, the installer would emit a flood of
RequiredInstallerResourcesMissing events for transient missing secrets
and configmaps, causing the [bz-etcd] pathological test to fail.

This patch adds a 30-second rate-limit per unique set of missing resources:

Events for the same missing set are only emitted once per 30 seconds.
Aggregated errors are still returned to trigger retries.
Uses InstallerController.now() for testable timestamps.
lastMissingEvent map tracks last emission times per missing resource set.
This prevents event spam while preserving retry logic and ensures
transient missing resources are still detected correctly.

…uring SNO upgrades

Signed-off-by: Shaza Aldawamneh <shaza.aldawamneh@hotmail.com>
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 23, 2025
@openshift-ci-robot
Copy link

@ShazaAldawamneh: This pull request references Jira Issue OCPBUGS-39241, which is invalid:

  • expected the bug to target the "4.18.z" version, but no target version was set
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-39241 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Rate-limit RequiredInstallerResourcesMissing events in InstallerController

During SNO upgrades, the installer would emit a flood of
RequiredInstallerResourcesMissing events for transient missing secrets
and configmaps, causing the [bz-etcd] pathological test to fail.

This patch adds a 30-second rate-limit per unique set of missing resources:

Events for the same missing set are only emitted once per 30 seconds.
Aggregated errors are still returned to trigger retries.
Uses InstallerController.now() for testable timestamps.
lastMissingEvent map tracks last emission times per missing resource set.
This prevents event spam while preserving retry logic and ensures
transient missing resources are still detected correctly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ShazaAldawamneh ShazaAldawamneh marked this pull request as ready for review October 23, 2025 14:16
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ShazaAldawamneh
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2025

@ShazaAldawamneh: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ShazaAldawamneh ShazaAldawamneh changed the title [WIP]: OCPBUGS-39241: Rate-limit RequiredInstallerResourcesMissing events in InstallerController to prevent SNO upgrade test failures OCPBUGS-39241: Rate-limit RequiredInstallerResourcesMissing events in InstallerController to prevent SNO upgrade test failures Oct 24, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 24, 2025
Copy link

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the event creation frequency the problem here, or is it a symptom of another issue?

The bug says we see this on upgrades from 4.17 -> 4.18. Are we seeing this on any other releases/upgrades?

I'm skeptical of rate limiting event creation as the right solution here.

@wangke19
Copy link
Contributor

Better to have one PR to prove this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants