-
Notifications
You must be signed in to change notification settings - Fork 265
CORENET-6605: add retry logic for transient api server errors #2862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdded a retry-with-backoff wrapper around the Patch call in ApplyObject: new backoff config (Steps=6, Duration=5s, Factor=1.0, Jitter=0.1), retry.OnError loop with per-attempt logging and a final failure log; imports for Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jluhrsen The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/apply/apply.go (1)
130-135: Consider exponential backoff for faster recovery.With
Factor: 1.0, all retries wait a constant 5 seconds. For transient issues that resolve quickly (e.g., brief network blips), exponential backoff starting shorter would recover faster:var backoff = wait.Backoff{ Steps: 6, - Duration: 5 * time.Second, - Factor: 1.0, + Duration: 1 * time.Second, + Factor: 2.0, Jitter: 0.1, }This retries at ~1s, 2s, 4s, 8s, 16s, 32s intervals—faster initial recovery while still reaching similar total wait time.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
pkg/apply/apply.go(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/apply/apply.go
| err = retry.OnError(backoff, func(error) bool { return true }, func() error { | ||
| attempt++ | ||
| _, err := clusterClient.Dynamic().Resource(rm.Resource).Namespace(namespace).Patch(ctx, name, types.ApplyPatchType, data, patchOptions, subresources...) | ||
| if err != nil { | ||
| log.Printf("Error applying %s (attempt %d/%d): %v", objDesc, attempt, backoff.Steps, err) | ||
| } | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry predicate should filter for transient errors only.
The current predicate func(error) bool { return true } retries all errors unconditionally, including permanent failures (Forbidden, NotFound, Invalid) and context cancellation. This contradicts the PR objective of handling "transient API server errors" and will:
- Waste ~30 seconds retrying errors that will never succeed
- Delay graceful shutdown when context is cancelled
- Retry authorization failures that require intervention
🔎 Proposed fix: filter for transient/server errors
- err = retry.OnError(backoff, func(error) bool { return true }, func() error {
+ err = retry.OnError(backoff, func(err error) bool {
+ // Only retry on transient server-side errors, not permanent client errors
+ return apierrors.IsServerTimeout(err) ||
+ apierrors.IsServiceUnavailable(err) ||
+ apierrors.IsTooManyRequests(err) ||
+ apierrors.IsTimeout(err) ||
+ apierrors.IsInternalError(err)
+ }, func() error {
attempt++
_, err := clusterClient.Dynamic().Resource(rm.Resource).Namespace(namespace).Patch(ctx, name, types.ApplyPatchType, data, patchOptions, subresources...)
if err != nil {
log.Printf("Error applying %s (attempt %d/%d): %v", objDesc, attempt, backoff.Steps, err)
}
return err
})Alternatively, consider retry.DefaultBackoff or retry.DefaultRetry which provide battle-tested defaults, or use apierrors.IsRetryableError() if available in your k8s client-go version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the errors I'm trying to retry on are things like "connection timeout", "connection refused", etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for existing retry logic and error handling patterns
echo "=== Searching for existing retry patterns ==="
rg -n -C3 "retry\.(OnError|RetryOnConflict)" --type=go
echo -e "\n=== Searching for connection error handling ==="
rg -n -C3 "connection (refused|timeout|reset)" --type=go -i
echo -e "\n=== Searching for net.Error usage ==="
rg -n -C3 "net\.Error" --type=go
echo -e "\n=== Searching for IsTransient or IsRetryable patterns ==="
rg -n -C3 "Is(Transient|Retryable)" --type=go
Length of output: 9922
@jluhrsen Thank you for clarifying! Connection-level errors like "connection refused" and "connection timeout" occur during API server rollouts before any HTTP response is received, so they won't be caught by the standard apierrors.Is*() functions I suggested.
For your use case, you'll want to retry both connection errors AND server errors, while still filtering out permanent client errors. Here's a targeted predicate:
err = retry.OnError(backoff, func(err error) bool {
// Retry on server-side HTTP errors
if apierrors.IsServerTimeout(err) ||
apierrors.IsServiceUnavailable(err) ||
apierrors.IsTooManyRequests(err) ||
apierrors.IsTimeout(err) ||
apierrors.IsInternalError(err) {
return true
}
// Don't retry permanent client errors
if apierrors.IsForbidden(err) ||
apierrors.IsNotFound(err) ||
apierrors.IsInvalid(err) ||
apierrors.IsMethodNotSupported(err) {
return false
}
// Retry connection errors (refused, timeout, reset, etc.)
// These manifest as error strings containing these patterns
errStr := err.Error()
return strings.Contains(errStr, "connection refused") ||
strings.Contains(errStr, "connection reset") ||
strings.Contains(errStr, "connection timed out") ||
strings.Contains(errStr, "i/o timeout") ||
strings.Contains(errStr, "net/http: request canceled")
}, func() error {
// ... existing code
})This approach handles your API server rollout scenario while avoiding wasted retries on errors that will never succeed (like authorization failures).
🧬 Code Graph Analysis Results
[pkg/apply/apply.go:ApplyObject retry behavior summary]
Summary of retry logic for server-side apply:
- Function: ApplyObject(...)
- Context: Uses retry.OnError to retry the Patch call against the server when transient errors occur.
- Backoff configuration:
- Steps: 6
- Duration: 5 seconds
- Factor: 1.0
- Jitter: 0.1
- Retry predicate:
- The error filter always returns true (retry on any error).
- Operation retried:
- clusterClient.Dynamic().Resource(rm.Resource).Namespace(namespace).Patch(ctx, name, types.ApplyPatchType, data, patchOptions, subresources...)
- Logging:
- On each failed attempt, logs: "Error applying %s (attempt %d/%d): %v"
- After all attempts, logs: "Failed to apply %s after %d attempts"
- Return behavior:
- If all retries fail, returns an error: "failed to apply / update %s: %w"
- If successful, proceeds to log success: "Apply / Create of %s was successful"
Relevant details:
- The predicate always retries, so timeouts, connection refusals, and similar transient errors are retried according to the backoff policy.
- The data encoded for patching is prepared prior to the retry loop; failure to encode results in an immediate error before retries.
|
@jluhrsen: This pull request references CORENET-6605 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
CNO is going Degraded on the first connection issue with the API server, but that can happen briefly on a new rollout. This is seen periodically in test cases doing a new rollout on purpose like this one [0]. even the test case does a retry [1] because of this. [0] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L26 [1] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L57-L63 Signed-off-by: Jamo Luhrsen <jluhrsen@gmail.com>
60923fd to
dd0058d
Compare
|
@jluhrsen: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
CNO is going Degraded on the first connection issue with the API server, but that can happen briefly on a new rollout. This is seen periodically in test cases doing a new rollout on purpose like this one [0]. even the test case does a retry [1] because of this.
[0] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L26
[1] https://github.com/openshift/origin/blob/3854d32174b5e9ddaded1dfcc8a865bb28ca04ad/test/extended/networking/services.go#L57-L63