Correct compliance on no-op dryrun updates #404

JustinKuli · 2025-10-29T19:53:25Z

Previously, the policy status would incorrectly have a noncompliant event followed immediately by a compliant event when a dryrun update would not make any changes on that object. This would cause the policy status to repeatedly update if the policy was reconciling often - for example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

Refs:

https://issues.redhat.com/browse/ACM-25694

Previously, the policy status would incorrectly have a noncompliant event followed immediately by a compliant event when a dryrun update would not make any changes on that object. This would cause the policy status to repeatedly update if the policy was reconciling often - for example if any of its watched objects were updating. Now only the correct "compliant" event should be emitted in that case. Refs: - https://issues.redhat.com/browse/ACM-25694 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>

JustinKuli · 2025-10-29T19:53:59Z

test/e2e/case8_status_check_test.go

+
+				history, _, err := unstructured.NestedSlice(managedPlc.Object, "status", "history")
+				g.Expect(err).NotTo(HaveOccurred())
+				g.Expect(history).To(HaveLen(1))


Without the change, the history for this policy would have 2 entries.

JustinKuli · 2025-10-29T20:32:21Z

This flake(?) happened:

  [FAILED] in [It] - /home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case47_ratelimit_test.go:55 @ 10/29/25 19:57:14.973
  << Timeline

  [FAILED] Failed after 3.859s.
  The function passed to Consistently returned the following error:
      <*errors.errorString | 0xc0002ae080>: 
      failed to retrieve any config_policy_evaluation_total metric
      {
          s: "failed to retrieve any config_policy_evaluation_total metric",
      }
  At one point, however, the function did return successfully.
  Yet, Consistently failed because the matcher was not satisfied:
  Expected
      <float64>: 1
  to be <
      <int>: 4
  In [It] at: /home/runner/work/config-policy-controller/config-policy-controller/test/e2e/case47_ratelimit_test.go:55 @ 10/29/25 19:57:14.973

That's at least twice now, in a relatively new test, so it will be worth investigating soon.

dhaiducek

This is changing code logic for an enforce policy, but as far as I can tell the policy in the issue is in inform, so I'm not actually sure this would resolve it? If the object is constantly being updated, I'd expect the policy to churn, right? Or am I missing something here?

JustinKuli · 2025-10-29T21:08:19Z

This is changing code logic for an enforce policy, but as far as I can tell the policy in the issue is in inform, so I'm not actually sure this would resolve it? If the object is constantly being updated, I'd expect the policy to churn, right? Or am I missing something here?

I don't think this code change is just for enforce? The test case is using an inform policy at first, and its behavior changed...

The policy "churn" isn't that it is constantly sending actual updates to a managed object, it's only that the policy's status is flipping to noncompliant and then compliant whenever it reconciles. The bug is that the noncompliant status is incorrect: the log notes that the dry-run request shows no changes are needed, but it's reporting as "found but not as specified" anyway. It should only be reporting that the resource is compliant.

In a policy for an object that doesn't have other updates streaming in, it doesn't really churn: for example in the test, it was just emitting those two events. But if something irrelevant updates in that managed object, like an annotation or the status, that will trigger a reconcile on the policy. So previously in that case, it was re-emitting both the (incorrect) noncompliant event, and the compliant event. It should only be trying to emit the compliant event, and then stop itself from doing that, because that exactly matches the previous/current status.

Does that make sense?

dhaiducek

That makes sense--sorry, I got lost in the conditionals. So the updated block is happening regardless of remediationAction. I'm just really baffled on why this flow wouldn't be working and needs to be overridden. And I'm also concerned we might be covering up some edge case perhaps around a policy with a status check or changing existing behavior by overriding the boolean especially since this would be backported to 2.14. But also things seem to be working as expected and are passing the tests which also includes status checking, so I'm not sure I need to be overthinking it any more. 😅

I also noticed this commit, which might explain why they have annotations: {} in there? stolostron/config-policy-controller@34c02b8

openshift-ci · 2025-10-30T16:16:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhaiducek, JustinKuli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JustinKuli,dhaiducek]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JustinKuli · 2025-10-30T16:27:28Z

You're right to question it! And those conditions get hairy, especially in GitHub, and with my other small changes cluttering things up...

The main change is:

-			// Assume object is compliant by inverting the value of throwSpecViolation
+			// treat the object as compliant, with no updates needed
-			r.setEvaluatedObject(obj.policy, obj.existingObj, !throwSpecViolation, "")
+			r.setEvaluatedObject(obj.policy, obj.existingObj, true, "")
-			matchesAfterDryRun = true
-			return throwSpecViolation, "", diff, updateNeeded, updatedObj, matchesAfterDryRun
+			return false, "", diff, false, updatedObj, true

And I suspect the old code was incorrectly copied or trying to be clever, and we didn't notice in our specific cases because there would be another event emitted matching the status we were expecting. In particular, continuing to work with throwSpecViolation, when I think that in this situation we should just be marking things as compliant.

dhaiducek · 2025-10-30T16:31:18Z

Right, I eventually noticed the rest were no-ops since the value for matchesAfterDryRun would be false anyway.

For a future PR, it looks like throwSpecViolation and statusMismatch are identical and one of them can be removed? I prefer statusMismatch personally since throwSpecViolation doesn't make a whole lot of sense for that. 🙂

Previously, the policy status would incorrectly have a noncompliant event followed immediately by a compliant event when a dryrun update would not make any changes on that object. This would cause the policy status to repeatedly update if the policy was reconciling often - for example if any of its watched objects were updating. Now only the correct "compliant" event should be emitted in that case. NOTE: This is a backport, but not just a cherry-pick. This incorporates some of the changes from these upstream PRs, with updates for tests: - open-cluster-management-io/config-policy-controller#404 - open-cluster-management-io/config-policy-controller#377 Refs: - https://issues.redhat.com/browse/ACM-25742 Signed-off-by: Justin Kulikauskas <jkulikau@redhat.com>

openshift-ci bot added the dco-signoff: yes label Oct 29, 2025

openshift-ci bot requested review from dhaiducek and yiraeChristineKim October 29, 2025 19:53

openshift-ci bot added the approved label Oct 29, 2025

JustinKuli commented Oct 29, 2025

View reviewed changes

dhaiducek reviewed Oct 29, 2025

View reviewed changes

dhaiducek approved these changes Oct 30, 2025

View reviewed changes

openshift-ci bot assigned dhaiducek Oct 30, 2025

openshift-ci bot added the lgtm label Oct 30, 2025

openshift-merge-bot bot merged commit 3fd6da7 into open-cluster-management-io:main Oct 30, 2025
29 of 33 checks passed

magic-mirror-bot bot mentioned this pull request Oct 30, 2025

🤖 Sync from open-cluster-management-io/config-policy-controller: #404 stolostron/config-policy-controller#1546

Merged

JustinKuli mentioned this pull request Oct 30, 2025

Correct compliance on no-op dryrun updates stolostron/config-policy-controller#1547

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correct compliance on no-op dryrun updates #404

Correct compliance on no-op dryrun updates #404

JustinKuli commented Oct 29, 2025

Uh oh!

JustinKuli Oct 29, 2025

Uh oh!

JustinKuli commented Oct 29, 2025

Uh oh!

dhaiducek left a comment

Uh oh!

JustinKuli commented Oct 29, 2025

Uh oh!

dhaiducek left a comment •

edited

Loading

Uh oh!

openshift-ci bot commented Oct 30, 2025

Uh oh!

Uh oh!

JustinKuli commented Oct 30, 2025

Uh oh!

dhaiducek commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Correct compliance on no-op dryrun updates #404

Correct compliance on no-op dryrun updates #404

Conversation

JustinKuli commented Oct 29, 2025

Uh oh!

JustinKuli Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

JustinKuli commented Oct 29, 2025

Uh oh!

dhaiducek left a comment

Choose a reason for hiding this comment

Uh oh!

JustinKuli commented Oct 29, 2025

Uh oh!

dhaiducek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Oct 30, 2025

Uh oh!

Uh oh!

JustinKuli commented Oct 30, 2025

Uh oh!

dhaiducek commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhaiducek left a comment •

edited

Loading