add cordon and drain for upgrading by zqingqing1 · Pull Request #129 · Azure/AKSFlexNode

zqingqing1 · 2026-03-17T22:40:25Z

log

level=warning msg="Drift detected: id=kubernetes-version title=Kubernetes version drift details=kubelet=\"1.32.7\" desired=\"1.33.2\"" func="[remediation.go:92]"
level=info msg="Starting AKS node drift-kubernetes-upgrade" func="[executor.go:68]"
**level=info msg="Executing drift-kubernetes-upgrade step cordon-and-drain" func="[executor.go:123]"
level=info msg="Cordoning node free-node before kubelet upgrade" func="[node_maintenance.go:270]"
level=info msg="Draining node free-node before kubelet upgrade" func="[node_maintenance.go:279]"
level=info msg="drift-kubernetes-upgrade step: cordon-and-drain completed successfully with duration 32.873261174s" func="**[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step stop-kubelet" func="[executor.go:123]"
level=info msg="drift-kubernetes-upgrade step: stop-kubelet completed successfully with duration 43.442382ms" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step download-kube-binaries" func="[executor.go:123]"
level=info msg="drift-kubernetes-upgrade step: download-kube-binaries completed successfully with duration 5.755130017s" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step start-kubelet" func="[executor.go:123]"
level=info msg="drift-kubernetes-upgrade step: start-kubelet completed successfully with duration 76.14647ms" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step uncordon" func="[executor.go:123]"
**level=info msg="Uncordoning node free-node after kubelet upgrade" func="[node_maintenance.go:334]"
level=info msg="drift-kubernetes-upgrade step: uncordon completed successfully with duration 90.40138ms" func="[executor.go:147]"**
level=info msg="AKS node drift-kubernetes-upgrade completed successfully (duration: 38.839009005s, stepCount: 5)" func="[executor.go:106]"
level=info msg="drift-kubernetes-upgrade completed successfully (duration: 38.839009005s, steps: 5)" func="[remediation.go:243]"
level=info msg="Kubernetes upgrade remediation completed successfully" func="[remediation.go:131]"


level=info msg="Initial drift detection after spec collection completed successfully" func="[commands.go:195]"

node tracking:

Croot@free-node:/home/qizhe# kubectl get node -o wide -w
NAME                                STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-27087470-vmss000000   Ready    <none>   10h     v1.33.2   172.19.0.47    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
aks-nodepool1-27087470-vmss000001   Ready    <none>   10h     v1.33.2   172.19.0.18    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
aks-nodepool1-27087470-vmss000002   Ready    <none>   10h     v1.33.2   172.19.0.10    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
flex-node                           Ready    <none>   53d     v1.32.7   10.5.202.191   <none>        Ubuntu 24.04.3 LTS   6.8.0-106-generic   containerd://1.7.20
free-node                           Ready    <none>   6d21h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready,SchedulingDisabled   <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready,SchedulingDisabled   <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
aks-nodepool1-27087470-vmss000000   Ready                      <none>   10h     v1.33.2   172.19.0.47    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
aks-nodepool1-27087470-vmss000001   Ready                      <none>   10h     v1.33.2   172.19.0.18    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
^Croot@free-node:/home/qizhe#

pod tracking:

^Croot@free-node:/home/qizhe# kubectl get pod -o wide -w
NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
busybox-65bb6db647-fkzvc   1/1     Running   0          38s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-fkzvc   1/1     Running   0          66s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-fkzvc   1/1     Terminating   0          66s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-fkzvc   1/1     Terminating   0          66s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-jd6nk   0/1     Pending       0          0s    <none>       <none>      <none>           <none>
busybox-65bb6db647-jd6nk   0/1     Pending       0          0s    <none>       aks-nodepool1-27087470-vmss000000   <none>           <none>
busybox-65bb6db647-jd6nk   0/1     ContainerCreating   0          0s    <none>       aks-nodepool1-27087470-vmss000000   <none>           <none>
busybox-65bb6db647-jd6nk   1/1     Running             0          2s    172.19.0.55   aks-nodepool1-27087470-vmss000000   <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Terminating         0          96s   10.244.0.3    free-node                           <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Error               0          96s   10.244.0.3    free-node                           <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Error               0          97s   10.244.0.3    free-node                           <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Error               0          97s   10.244.0.3    free-node                           <none>           <none>
root@free-node:/home/qizhe#

Copilot

Pull request overview

This PR adds Kubernetes node cordon/drain + uncordon behavior around the drift “kubernetes upgrade” remediation flow, and introduces reusable Kubernetes clientset helpers to avoid shelling out to kubectl for some node status checks.

Changes:

Add a new remediation step sequence for kubelet upgrades: cordon+drain → stop kubelet → download binaries → start kubelet → uncordon.
Introduce pkg/kube helpers for cached kubelet clientset and AKS-admin clientset creation.
Replace kubelet readiness probing via kubectl invocation with a client-go Node GET.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/status/collector.go	Switch kubelet readiness check from `kubectl` jsonpath to client-go Node condition inspection.
pkg/kube/client.go	Add client-go helpers: cached kubelet clientset + admin clientset built from AKS admin kubeconfig.
pkg/drift/remediation.go	Insert new “cordon-and-drain” and “uncordon” steps into Kubernetes upgrade remediation sequence.
pkg/drift/node_maintenance.go	Implement node maintenance operations (cordon/drain/uncordon) using `k8s.io/kubectl/pkg/drain`, with admin fallback.
pkg/drift/node_maintenance_test.go	Add unit tests for cordon/drain/uncordon orchestration and retry detection.
go.mod	Add `k8s.io/kubectl` dependency and update/introduce several indirect deps.
go.sum	Update sums for newly introduced/updated dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

go.mod

pkg/drift/node_maintenance.go

pkg/kube/client.go

pkg/status/collector.go

Copilot

Pull request overview

This PR adds node cordon/drain + uncordon around drift-driven Kubernetes (kubelet/binaries) upgrades, and improves status snapshot safety/accuracy so health checks don’t react to stale upgrade state.

Changes:

Add cordon-and-drain and uncordon steps to the Kubernetes upgrade remediation flow (with best-effort uncordon retry).
Introduce pkg/drift/node_maintenance.go (client-go + kubectl drain library) to cordon/drain/uncordon, preferring admin credentials when needed.
Improve status snapshot handling: in-process file lock for writers, “mark healthy after upgrade” snapshot update, and client-go based kubelet readiness check.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/drift/remediation.go	Adds upgrade step constants, inserts cordon/drain + uncordon steps, and updates status handling on success/failure.
pkg/drift/remediation_test.go	Adds tests for when upgrade failures should mark kubelet unhealthy.
pkg/drift/node_maintenance.go	Implements cordon/drain/uncordon via client-go + `kubectl/pkg/drain`, with admin fallback.
pkg/drift/node_maintenance_test.go	Unit tests for node maintenance executors and admin-retry detection.
pkg/kube/client.go	Adds cached kubelet clientset and AKS admin-kubeconfig clientset helpers.
pkg/status/collector.go	Switches node readiness check from `kubectl` invocation to client-go.
pkg/status/health.go	Adds “mark kubelet healthy after upgrade” and serializes status read-modify-write operations.
pkg/status/health_test.go	Adds coverage for “healthy after upgrade” status update behavior.
pkg/status/loader.go	Splits loader into unlocked helper for use under new status lock.
pkg/status/lock.go	Introduces in-process mutex for serializing status snapshot updates.
pkg/status/writer.go	Wraps status writes with the new status-file mutex.
go.mod / go.sum	Adds `k8s.io/kubectl` dependency and updates related transitive deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/drift/remediation.go

pkg/status/collector.go

Copilot

Pull request overview

Adds node cordon/drain support to the Kubernetes upgrade drift remediation flow, and improves status/health reporting and Kubernetes API interactions to better reflect node state during/after upgrades.

Changes:

Add cordon-and-drain and uncordon steps around kubelet binary upgrade remediation, including retry logic and unit tests.
Introduce in-process locking for status snapshot read/modify/write operations and add a “mark kubelet healthy after upgrade” status update (+ tests).
Replace kubectl get node ... readiness probing with a client-go call via a cached kubelet clientset; add shared kube client helpers (kubelet/admin).

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/status/writer.go	Wrap status writes with an in-process mutex; split unlocked write helper.
pkg/status/lock.go	New status file mutex helper to prevent lost updates within the agent process.
pkg/status/loader.go	Split unlocked load helper for use inside lock-protected sections.
pkg/status/health.go	Lock-protected status updates; add “mark healthy after upgrade” helper.
pkg/status/health_test.go	Add test ensuring “mark healthy after upgrade” preserves unrelated fields.
pkg/status/collector.go	Switch kubelet readiness probing to client-go `Nodes().Get` with timeout/constants.
pkg/kube/client.go	New cached kubelet clientset + admin clientset via AKS management-plane kubeconfig.
pkg/drift/remediation.go	Add upgrade step constants, cordon/drain + uncordon steps, and status update after successful upgrade.
pkg/drift/remediation_test.go	Add tests for `shouldMarkKubeletUnhealthyAfterUpgradeFailure` behavior.
pkg/drift/node_maintenance.go	New Kubernetes node maintenance implementation (cordon/drain/uncordon) with admin retry and drain helper config.
pkg/drift/node_maintenance_test.go	Unit tests for cordon/drain/uncordon executor behavior and admin-retry predicate.
go.mod	Add `k8s.io/kubectl` dependency (and indirect deps).
go.sum	Update sums for new/updated dependencies pulled in by kubectl/drain usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/drift/remediation_test.go

+		t.Fatalf("start-kubelet failure marked unhealthy=false, want true")
+	}
+
+	// Unknown step -> conservative true.


wenxuan0923 · 2026-03-18T01:51:53Z

pkg/drift/node_maintenance.go

+		Force:               false,
+		GracePeriodSeconds:  -1,
+		IgnoreAllDaemonSets: true,
+		DeleteEmptyDirData:  false,


What does this DeleteEmptyDirData: false, mean? A pod using emptyDir volumes won't be able to get drained?

wenxuan0923 · 2026-03-18T01:54:06Z

pkg/drift/node_maintenance.go

+	}
+
+	m.mu.Lock()
+	m.client = cs


Curious would admin kubeconfig ever expire?

wenxuan0923 · 2026-03-18T02:01:44Z

pkg/kube/client.go

+	return cs, nil
+}
+
+func fetchClusterAdminKubeconfig(ctx context.Context, cfg *config.Config) ([]byte, error) {


Seems duplicated with pkg/bootstrapper/cluster_config_enricher.go , is there a way to share?

wenxuan0923 · 2026-03-18T02:05:12Z

pkg/kube/client.go

+	}
+
+	kubeletClient = cs
+	kubeletErr = nil


nit: don't think we need this variable

add cordon and drain for upgrading

3d9d276

Copilot AI review requested due to automatic review settings March 17, 2026 22:40

zqingqing1 temporarily deployed to e2e-testing March 17, 2026 22:40 — with GitHub Actions Inactive

Copilot started reviewing on behalf of zqingqing1 March 17, 2026 22:41 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

add status file lock

41bbc66

zqingqing1 temporarily deployed to e2e-testing March 17, 2026 23:47 — with GitHub Actions Inactive

address copilot comments

34fdf76

Copilot AI review requested due to automatic review settings March 17, 2026 23:58

zqingqing1 temporarily deployed to e2e-testing March 17, 2026 23:58 — with GitHub Actions Inactive

Copilot started reviewing on behalf of zqingqing1 March 17, 2026 23:59 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

pkg/drift/remediation.go Outdated Show resolved Hide resolved

pkg/drift/remediation.go Show resolved Hide resolved

pkg/status/collector.go Show resolved Hide resolved

address copilot comments again

4b6f789

zqingqing1 temporarily deployed to e2e-testing March 18, 2026 00:15 — with GitHub Actions Inactive

fix ut

8f441ba

Copilot AI review requested due to automatic review settings March 18, 2026 00:24

zqingqing1 deployed to e2e-testing March 18, 2026 00:24 — with GitHub Actions Active

Copilot started reviewing on behalf of zqingqing1 March 18, 2026 00:25 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

pkg/drift/remediation_test.go

t.Fatalf("start-kubelet failure marked unhealthy=false, want true")

}

// Unknown step -> conservative true.

wenxuan0923 reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cordon and drain for upgrading#129

add cordon and drain for upgrading#129
zqingqing1 wants to merge 5 commits intomainfrom
qizhe/add-cordon-drain

zqingqing1 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

wenxuan0923 Mar 18, 2026

Uh oh!

wenxuan0923 Mar 18, 2026

Uh oh!

wenxuan0923 Mar 18, 2026

Uh oh!

wenxuan0923 Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zqingqing1 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

wenxuan0923 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

wenxuan0923 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

wenxuan0923 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

wenxuan0923 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants