Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds Kubernetes node cordon/drain + uncordon behavior around the drift “kubernetes upgrade” remediation flow, and introduces reusable Kubernetes clientset helpers to avoid shelling out to kubectl for some node status checks.
Changes:
- Add a new remediation step sequence for kubelet upgrades: cordon+drain → stop kubelet → download binaries → start kubelet → uncordon.
- Introduce
pkg/kubehelpers for cached kubelet clientset and AKS-admin clientset creation. - Replace kubelet readiness probing via
kubectlinvocation with a client-go Node GET.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/status/collector.go | Switch kubelet readiness check from kubectl jsonpath to client-go Node condition inspection. |
| pkg/kube/client.go | Add client-go helpers: cached kubelet clientset + admin clientset built from AKS admin kubeconfig. |
| pkg/drift/remediation.go | Insert new “cordon-and-drain” and “uncordon” steps into Kubernetes upgrade remediation sequence. |
| pkg/drift/node_maintenance.go | Implement node maintenance operations (cordon/drain/uncordon) using k8s.io/kubectl/pkg/drain, with admin fallback. |
| pkg/drift/node_maintenance_test.go | Add unit tests for cordon/drain/uncordon orchestration and retry detection. |
| go.mod | Add k8s.io/kubectl dependency and update/introduce several indirect deps. |
| go.sum | Update sums for newly introduced/updated dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR adds node cordon/drain + uncordon around drift-driven Kubernetes (kubelet/binaries) upgrades, and improves status snapshot safety/accuracy so health checks don’t react to stale upgrade state.
Changes:
- Add
cordon-and-drainanduncordonsteps to the Kubernetes upgrade remediation flow (with best-effort uncordon retry). - Introduce
pkg/drift/node_maintenance.go(client-go + kubectl drain library) to cordon/drain/uncordon, preferring admin credentials when needed. - Improve status snapshot handling: in-process file lock for writers, “mark healthy after upgrade” snapshot update, and client-go based kubelet readiness check.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/drift/remediation.go | Adds upgrade step constants, inserts cordon/drain + uncordon steps, and updates status handling on success/failure. |
| pkg/drift/remediation_test.go | Adds tests for when upgrade failures should mark kubelet unhealthy. |
| pkg/drift/node_maintenance.go | Implements cordon/drain/uncordon via client-go + kubectl/pkg/drain, with admin fallback. |
| pkg/drift/node_maintenance_test.go | Unit tests for node maintenance executors and admin-retry detection. |
| pkg/kube/client.go | Adds cached kubelet clientset and AKS admin-kubeconfig clientset helpers. |
| pkg/status/collector.go | Switches node readiness check from kubectl invocation to client-go. |
| pkg/status/health.go | Adds “mark kubelet healthy after upgrade” and serializes status read-modify-write operations. |
| pkg/status/health_test.go | Adds coverage for “healthy after upgrade” status update behavior. |
| pkg/status/loader.go | Splits loader into unlocked helper for use under new status lock. |
| pkg/status/lock.go | Introduces in-process mutex for serializing status snapshot updates. |
| pkg/status/writer.go | Wraps status writes with the new status-file mutex. |
| go.mod / go.sum | Adds k8s.io/kubectl dependency and updates related transitive deps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds node cordon/drain support to the Kubernetes upgrade drift remediation flow, and improves status/health reporting and Kubernetes API interactions to better reflect node state during/after upgrades.
Changes:
- Add
cordon-and-drainanduncordonsteps around kubelet binary upgrade remediation, including retry logic and unit tests. - Introduce in-process locking for status snapshot read/modify/write operations and add a “mark kubelet healthy after upgrade” status update (+ tests).
- Replace
kubectl get node ...readiness probing with a client-go call via a cached kubelet clientset; add shared kube client helpers (kubelet/admin).
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/status/writer.go | Wrap status writes with an in-process mutex; split unlocked write helper. |
| pkg/status/lock.go | New status file mutex helper to prevent lost updates within the agent process. |
| pkg/status/loader.go | Split unlocked load helper for use inside lock-protected sections. |
| pkg/status/health.go | Lock-protected status updates; add “mark healthy after upgrade” helper. |
| pkg/status/health_test.go | Add test ensuring “mark healthy after upgrade” preserves unrelated fields. |
| pkg/status/collector.go | Switch kubelet readiness probing to client-go Nodes().Get with timeout/constants. |
| pkg/kube/client.go | New cached kubelet clientset + admin clientset via AKS management-plane kubeconfig. |
| pkg/drift/remediation.go | Add upgrade step constants, cordon/drain + uncordon steps, and status update after successful upgrade. |
| pkg/drift/remediation_test.go | Add tests for shouldMarkKubeletUnhealthyAfterUpgradeFailure behavior. |
| pkg/drift/node_maintenance.go | New Kubernetes node maintenance implementation (cordon/drain/uncordon) with admin retry and drain helper config. |
| pkg/drift/node_maintenance_test.go | Unit tests for cordon/drain/uncordon executor behavior and admin-retry predicate. |
| go.mod | Add k8s.io/kubectl dependency (and indirect deps). |
| go.sum | Update sums for new/updated dependencies pulled in by kubectl/drain usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| t.Fatalf("start-kubelet failure marked unhealthy=false, want true") | ||
| } | ||
|
|
||
| // Unknown step -> conservative true. |
| Force: false, | ||
| GracePeriodSeconds: -1, | ||
| IgnoreAllDaemonSets: true, | ||
| DeleteEmptyDirData: false, |
There was a problem hiding this comment.
What does this DeleteEmptyDirData: false, mean? A pod using emptyDir volumes won't be able to get drained?
| } | ||
|
|
||
| m.mu.Lock() | ||
| m.client = cs |
There was a problem hiding this comment.
Curious would admin kubeconfig ever expire?
| return cs, nil | ||
| } | ||
|
|
||
| func fetchClusterAdminKubeconfig(ctx context.Context, cfg *config.Config) ([]byte, error) { |
There was a problem hiding this comment.
Seems duplicated with pkg/bootstrapper/cluster_config_enricher.go , is there a way to share?
| } | ||
|
|
||
| kubeletClient = cs | ||
| kubeletErr = nil |
There was a problem hiding this comment.
nit: don't think we need this variable
log
node tracking:
pod tracking: