Skip dockercfg secret wait when image-registry pods are unhealthy#30782
Skip dockercfg secret wait when image-registry pods are unhealthy#30782weinliu wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: weinliu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Scheduling required tests: |
|
/retest |
|
Hi @petr-muller @dgoodwin, could you please take a look at this PR? It fixes a blocking issue (WINC-1578) where all |
|
I don't understand (and the description does not explain) why it is acceptable/expected for a debug/development cluster (e.g., Prow CI debug-winc-* jobs) to have a a broken service account token controller and therefore why the testsuite would need to tolerate it. |
WalkthroughAdds a guard condition to image-registry health validation on Windows-based clusters within test utilities. When ImageRegistry is enabled, the code checks for at least one healthy pod and conditionally clears configuration fields if none exist and Windows nodes are present. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b2f67cb to
5e7a37d
Compare
…ows clusters When image-registry pods exist but none are Running and Ready, skip the dockercfg secret wait only on Windows (WINC) clusters. This targets debug-winc-vsphere Prow CI jobs where the SA token controller is known to be broken/disabled, causing dockercfg secrets to never be created. On non-Windows clusters, the normal wait proceeds unchanged so real infrastructure failures are not silently ignored. Fixes: https://issues.redhat.com/browse/WINC-1578
c6ff4b7 to
5cf2547
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/util/client.go (1)
432-453: Add debug logging for pod/node list failures to improve triage.At Line 432 and Line 452, list errors are silently ignored. A small log here would make timeout investigations much easier when the skip path doesn’t activate.
Proposed diff
- if podErr == nil && len(pods.Items) > 0 { + if podErr != nil { + framework.Logf("Unable to list openshift-image-registry pods for health check: %v", podErr) + } else if len(pods.Items) > 0 { hasHealthyPod := false for _, pod := range pods.Items { if pod.Status.Phase == corev1.PodRunning { @@ - if winErr == nil && len(windowsNodes.Items) > 0 { + if winErr != nil { + framework.Logf("Unable to list Windows nodes for registry health guard: %v", winErr) + } else if len(windowsNodes.Items) > 0 { framework.Logf("Windows cluster with unhealthy image-registry pods, skipping dockercfg secret check") DefaultServiceAccounts = []string{} defaultRoleBindings = []string{} } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/util/client.go` around lines 432 - 453, The code currently ignores errors from listing pods and nodes (podErr and winErr) which makes debugging timeouts hard; update the health-check block around HasHealthyPod in the function using c.AdminKubeClient() so that when podErr != nil you log the error (via framework.Logf or process-equivalent) including context (e.g., which namespace/selector/pods list failed) and when winErr != nil you also log the nodes list error before relying on windowsNodes.Items; reference the variables podErr, pods, windowsNodes, winErr and the calls to c.AdminKubeClient().CoreV1().Pods().List / Nodes().List so the added messages clearly state the failed call and error string to aid triage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/extended/util/client.go`:
- Around line 432-453: The code currently ignores errors from listing pods and
nodes (podErr and winErr) which makes debugging timeouts hard; update the
health-check block around HasHealthyPod in the function using
c.AdminKubeClient() so that when podErr != nil you log the error (via
framework.Logf or process-equivalent) including context (e.g., which
namespace/selector/pods list failed) and when winErr != nil you also log the
nodes list error before relying on windowsNodes.Items; reference the variables
podErr, pods, windowsNodes, winErr and the calls to
c.AdminKubeClient().CoreV1().Pods().List / Nodes().List so the added messages
clearly state the failed call and error string to aid triage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a4fed2fd-ff5e-463e-91b4-aa860c1a1992
📒 Files selected for processing (1)
test/extended/util/client.go
@petr-muller Thanks for the review. I've updated the fix to be more targeted. On these clusters, image-registry pods exist but are unhealthy because the SA token controller is disabled. dockercfg secrets are never created, causing a 3-minute timeout per SA in setupProject(). The updated fix adds an additional check: it only skips the dockercfg wait when Windows nodes (kubernetes.io/os=windows) are present. * This scopes the behavior exclusively to Windows (WINC) clusters. On non-Windows clusters, the normal wait proceeds unchanged so real infrastructure failures are not silently ignored. |
|
Scheduling required tests: |
|
@weinliu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test debug-winc-aws-ipi |
|
@weinliu: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@petr-muller Adding CI evidence to support the above. I ran Every test hits: This confirms the SA token controller is non-functional on these debug clusters, causing dockercfg secrets to never be created. The fix in this PR targets exactly this scenario, scoped to Windows clusters only. |
Summary
Fixes WINC-1578
On debug/development clusters (e.g., Prow CI
debug-winc-*jobs), the service account token controller may be broken, which means dockercfg secrets are never created. Even though the ImageRegistry capability reports as enabled, the image-registry pods are not Running and Ready.This causes
setupProject()to wait 3 minutes per service account for dockercfg secrets that will never appear, then fail with a timeout:All 4
debug-winc-*Prow CI jobs (AWS, Azure, GCP, vSphere) have been continuously failing due to this issue. Every single test case times out at 3m5s.Root Cause Analysis
compat_otp.NewCLIWithoutNamespace("default")calls origin'sexutil.NewCLI(), which registerssetupProject()inBeforeEachsetupProject()checksIsCapabilityEnabled(ImageRegistry)→ returnstrue(capability is enabled)WaitForServiceAccountWithSecret()for "default" and "builder" SAsWaitForServiceAccountWithSecret()polls for 3 minutes waiting for-dockercfg-insa.ImagePullSecretsNewCLI/NewCLIWithoutNamespacefails after 3m5sThe problem is that
setupProject()only checks whether the ImageRegistry capability is enabled, but does not verify whether the image-registry pods are actually healthy.Changes
Added a pod health check in
setupProject()after the ImageRegistry capability check:openshift-image-registrynamespace with labeldocker-registry=defaultImpact
Evidence
Prow CI failure logs from
openshift-tests-privatePR #29169:timed out waiting for the condition (3m5s)atclient.go:424pull-ci-openshift-openshift-tests-private-main-debug-winc-vsphere-ipiNote: Direct verification via
openshift-tests-privateProw CI is not possible because thedebug-winc-*test steps use thetests-privateimage from the release payload (nottests-private-prbuilt from the PR), so code changes inopenshift-tests-privatePRs do not affect these test runs.Test plan
Summary by CodeRabbit