Skip to content

Conversation

@JustinKuli
Copy link
Member

Previously, the policy status would incorrectly have a noncompliant event followed immediately by a compliant event when a dryrun update would not make any changes on that object. This would cause the policy status to repeatedly update if the policy was reconciling often - for example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

Refs:

Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

Refs:
 - https://issues.redhat.com/browse/ACM-25694

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>

history, _, err := unstructured.NestedSlice(managedPlc.Object, "status", "history")
g.Expect(err).NotTo(HaveOccurred())
g.Expect(history).To(HaveLen(1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the change, the history for this policy would have 2 entries.

@JustinKuli
Copy link
Member Author

This flake(?) happened:

  [FAILED] in [It] - /home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case47_ratelimit_test.go:55 @ 10/29/25 19:57:14.973
  << Timeline

  [FAILED] Failed after 3.859s.
  The function passed to Consistently returned the following error:
      <*errors.errorString | 0xc0002ae080>: 
      failed to retrieve any config_policy_evaluation_total metric
      {
          s: "failed to retrieve any config_policy_evaluation_total metric",
      }
  At one point, however, the function did return successfully.
  Yet, Consistently failed because the matcher was not satisfied:
  Expected
      <float64>: 1
  to be <
      <int>: 4
  In [It] at: /home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case47_ratelimit_test.go:55 @ 10/29/25 19:57:14.973

That's at least twice now, in a relatively new test, so it will be worth investigating soon.

Copy link
Member

@dhaiducek dhaiducek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing code logic for an enforce policy, but as far as I can tell the policy in the issue is in inform, so I'm not actually sure this would resolve it? If the object is constantly being updated, I'd expect the policy to churn, right? Or am I missing something here?

@JustinKuli
Copy link
Member Author

This is changing code logic for an enforce policy, but as far as I can tell the policy in the issue is in inform, so I'm not actually sure this would resolve it? If the object is constantly being updated, I'd expect the policy to churn, right? Or am I missing something here?

I don't think this code change is just for enforce? The test case is using an inform policy at first, and its behavior changed...

The policy "churn" isn't that it is constantly sending actual updates to a managed object, it's only that the policy's status is flipping to noncompliant and then compliant whenever it reconciles. The bug is that the noncompliant status is incorrect: the log notes that the dry-run request shows no changes are needed, but it's reporting as "found but not as specified" anyway. It should only be reporting that the resource is compliant.

In a policy for an object that doesn't have other updates streaming in, it doesn't really churn: for example in the test, it was just emitting those two events. But if something irrelevant updates in that managed object, like an annotation or the status, that will trigger a reconcile on the policy. So previously in that case, it was re-emitting both the (incorrect) noncompliant event, and the compliant event. It should only be trying to emit the compliant event, and then stop itself from doing that, because that exactly matches the previous/current status.

Does that make sense?

Copy link
Member

@dhaiducek dhaiducek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense--sorry, I got lost in the conditionals. So the updated block is happening regardless of remediationAction. I'm just really baffled on why this flow wouldn't be working and needs to be overridden. And I'm also concerned we might be covering up some edge case perhaps around a policy with a status check or changing existing behavior by overriding the boolean especially since this would be backported to 2.14. But also things seem to be working as expected and are passing the tests which also includes status checking, so I'm not sure I need to be overthinking it any more. 😅

I also noticed this commit, which might explain why they have annotations: {} in there? stolostron/config-policy-controller@34c02b8

@openshift-ci
Copy link

openshift-ci bot commented Oct 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhaiducek, JustinKuli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JustinKuli,dhaiducek]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 3fd6da7 into open-cluster-management-io:main Oct 30, 2025
29 of 33 checks passed
@JustinKuli
Copy link
Member Author

You're right to question it! And those conditions get hairy, especially in GitHub, and with my other small changes cluttering things up...

The main change is:

-			// Assume object is compliant by inverting the value of throwSpecViolation
+			// treat the object as compliant, with no updates needed
-			r.setEvaluatedObject(obj.policy, obj.existingObj, !throwSpecViolation, "")
+			r.setEvaluatedObject(obj.policy, obj.existingObj, true, "")
-			matchesAfterDryRun = true
-			return throwSpecViolation, "", diff, updateNeeded, updatedObj, matchesAfterDryRun
+			return false, "", diff, false, updatedObj, true

And I suspect the old code was incorrectly copied or trying to be clever, and we didn't notice in our specific cases because there would be another event emitted matching the status we were expecting. In particular, continuing to work with throwSpecViolation, when I think that in this situation we should just be marking things as compliant.

@dhaiducek
Copy link
Member

Right, I eventually noticed the rest were no-ops since the value for matchesAfterDryRun would be false anyway.

For a future PR, it looks like throwSpecViolation and statusMismatch are identical and one of them can be removed? I prefer statusMismatch personally since throwSpecViolation doesn't make a whole lot of sense for that. 🙂

JustinKuli added a commit to stolostron/config-policy-controller that referenced this pull request Oct 30, 2025
Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

NOTE: This is a backport, but not just a cherry-pick. This incorporates
some of the changes from these upstream PRs, with updates for tests:
- open-cluster-management-io/config-policy-controller#404
- open-cluster-management-io/config-policy-controller#377

Refs:
 - https://issues.redhat.com/browse/ACM-25742

Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants