Skip to content

add cordon and drain for upgrading#129

Open
zqingqing1 wants to merge 5 commits intomainfrom
qizhe/add-cordon-drain
Open

add cordon and drain for upgrading#129
zqingqing1 wants to merge 5 commits intomainfrom
qizhe/add-cordon-drain

Conversation

@zqingqing1
Copy link
Member

log

level=warning msg="Drift detected: id=kubernetes-version title=Kubernetes version drift details=kubelet=\"1.32.7\" desired=\"1.33.2\"" func="[remediation.go:92]"
level=info msg="Starting AKS node drift-kubernetes-upgrade" func="[executor.go:68]"
**level=info msg="Executing drift-kubernetes-upgrade step cordon-and-drain" func="[executor.go:123]"
level=info msg="Cordoning node free-node before kubelet upgrade" func="[node_maintenance.go:270]"
level=info msg="Draining node free-node before kubelet upgrade" func="[node_maintenance.go:279]"
level=info msg="drift-kubernetes-upgrade step: cordon-and-drain completed successfully with duration 32.873261174s" func="**[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step stop-kubelet" func="[executor.go:123]"
level=info msg="drift-kubernetes-upgrade step: stop-kubelet completed successfully with duration 43.442382ms" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step download-kube-binaries" func="[executor.go:123]"
level=info msg="drift-kubernetes-upgrade step: download-kube-binaries completed successfully with duration 5.755130017s" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step start-kubelet" func="[executor.go:123]"
level=info msg="drift-kubernetes-upgrade step: start-kubelet completed successfully with duration 76.14647ms" func="[executor.go:147]"
level=info msg="Executing drift-kubernetes-upgrade step uncordon" func="[executor.go:123]"
**level=info msg="Uncordoning node free-node after kubelet upgrade" func="[node_maintenance.go:334]"
level=info msg="drift-kubernetes-upgrade step: uncordon completed successfully with duration 90.40138ms" func="[executor.go:147]"**
level=info msg="AKS node drift-kubernetes-upgrade completed successfully (duration: 38.839009005s, stepCount: 5)" func="[executor.go:106]"
level=info msg="drift-kubernetes-upgrade completed successfully (duration: 38.839009005s, steps: 5)" func="[remediation.go:243]"
level=info msg="Kubernetes upgrade remediation completed successfully" func="[remediation.go:131]"


level=info msg="Initial drift detection after spec collection completed successfully" func="[commands.go:195]"

node tracking:

Croot@free-node:/home/qizhe# kubectl get node -o wide -w
NAME                                STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-27087470-vmss000000   Ready    <none>   10h     v1.33.2   172.19.0.47    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
aks-nodepool1-27087470-vmss000001   Ready    <none>   10h     v1.33.2   172.19.0.18    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
aks-nodepool1-27087470-vmss000002   Ready    <none>   10h     v1.33.2   172.19.0.10    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
flex-node                           Ready    <none>   53d     v1.32.7   10.5.202.191   <none>        Ubuntu 24.04.3 LTS   6.8.0-106-generic   containerd://1.7.20
free-node                           Ready    <none>   6d21h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready,SchedulingDisabled   <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready,SchedulingDisabled   <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
aks-nodepool1-27087470-vmss000000   Ready                      <none>   10h     v1.33.2   172.19.0.47    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
aks-nodepool1-27087470-vmss000001   Ready                      <none>   10h     v1.33.2   172.19.0.18    <none>        Ubuntu 22.04.5 LTS   5.15.0-1102-azure   containerd://1.7.30-2
free-node                           Ready                      <none>   6d22h   v1.33.2   172.19.0.4     <none>        Ubuntu 24.04.3 LTS   6.14.0-1012-azure   containerd://2.0.4
^Croot@free-node:/home/qizhe#

pod tracking:

^Croot@free-node:/home/qizhe# kubectl get pod -o wide -w
NAME                       READY   STATUS    RESTARTS   AGE   IP           NODE        NOMINATED NODE   READINESS GATES
busybox-65bb6db647-fkzvc   1/1     Running   0          38s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-fkzvc   1/1     Running   0          66s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-fkzvc   1/1     Terminating   0          66s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-fkzvc   1/1     Terminating   0          66s   10.244.0.3   free-node   <none>           <none>
busybox-65bb6db647-jd6nk   0/1     Pending       0          0s    <none>       <none>      <none>           <none>
busybox-65bb6db647-jd6nk   0/1     Pending       0          0s    <none>       aks-nodepool1-27087470-vmss000000   <none>           <none>
busybox-65bb6db647-jd6nk   0/1     ContainerCreating   0          0s    <none>       aks-nodepool1-27087470-vmss000000   <none>           <none>
busybox-65bb6db647-jd6nk   1/1     Running             0          2s    172.19.0.55   aks-nodepool1-27087470-vmss000000   <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Terminating         0          96s   10.244.0.3    free-node                           <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Error               0          96s   10.244.0.3    free-node                           <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Error               0          97s   10.244.0.3    free-node                           <none>           <none>
busybox-65bb6db647-fkzvc   0/1     Error               0          97s   10.244.0.3    free-node                           <none>           <none>
root@free-node:/home/qizhe#

Copilot AI review requested due to automatic review settings March 17, 2026 22:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Kubernetes node cordon/drain + uncordon behavior around the drift “kubernetes upgrade” remediation flow, and introduces reusable Kubernetes clientset helpers to avoid shelling out to kubectl for some node status checks.

Changes:

  • Add a new remediation step sequence for kubelet upgrades: cordon+drain → stop kubelet → download binaries → start kubelet → uncordon.
  • Introduce pkg/kube helpers for cached kubelet clientset and AKS-admin clientset creation.
  • Replace kubelet readiness probing via kubectl invocation with a client-go Node GET.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/status/collector.go Switch kubelet readiness check from kubectl jsonpath to client-go Node condition inspection.
pkg/kube/client.go Add client-go helpers: cached kubelet clientset + admin clientset built from AKS admin kubeconfig.
pkg/drift/remediation.go Insert new “cordon-and-drain” and “uncordon” steps into Kubernetes upgrade remediation sequence.
pkg/drift/node_maintenance.go Implement node maintenance operations (cordon/drain/uncordon) using k8s.io/kubectl/pkg/drain, with admin fallback.
pkg/drift/node_maintenance_test.go Add unit tests for cordon/drain/uncordon orchestration and retry detection.
go.mod Add k8s.io/kubectl dependency and update/introduce several indirect deps.
go.sum Update sums for newly introduced/updated dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 17, 2026 23:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds node cordon/drain + uncordon around drift-driven Kubernetes (kubelet/binaries) upgrades, and improves status snapshot safety/accuracy so health checks don’t react to stale upgrade state.

Changes:

  • Add cordon-and-drain and uncordon steps to the Kubernetes upgrade remediation flow (with best-effort uncordon retry).
  • Introduce pkg/drift/node_maintenance.go (client-go + kubectl drain library) to cordon/drain/uncordon, preferring admin credentials when needed.
  • Improve status snapshot handling: in-process file lock for writers, “mark healthy after upgrade” snapshot update, and client-go based kubelet readiness check.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/drift/remediation.go Adds upgrade step constants, inserts cordon/drain + uncordon steps, and updates status handling on success/failure.
pkg/drift/remediation_test.go Adds tests for when upgrade failures should mark kubelet unhealthy.
pkg/drift/node_maintenance.go Implements cordon/drain/uncordon via client-go + kubectl/pkg/drain, with admin fallback.
pkg/drift/node_maintenance_test.go Unit tests for node maintenance executors and admin-retry detection.
pkg/kube/client.go Adds cached kubelet clientset and AKS admin-kubeconfig clientset helpers.
pkg/status/collector.go Switches node readiness check from kubectl invocation to client-go.
pkg/status/health.go Adds “mark kubelet healthy after upgrade” and serializes status read-modify-write operations.
pkg/status/health_test.go Adds coverage for “healthy after upgrade” status update behavior.
pkg/status/loader.go Splits loader into unlocked helper for use under new status lock.
pkg/status/lock.go Introduces in-process mutex for serializing status snapshot updates.
pkg/status/writer.go Wraps status writes with the new status-file mutex.
go.mod / go.sum Adds k8s.io/kubectl dependency and updates related transitive deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 18, 2026 00:24
@zqingqing1 zqingqing1 deployed to e2e-testing March 18, 2026 00:24 — with GitHub Actions Active
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds node cordon/drain support to the Kubernetes upgrade drift remediation flow, and improves status/health reporting and Kubernetes API interactions to better reflect node state during/after upgrades.

Changes:

  • Add cordon-and-drain and uncordon steps around kubelet binary upgrade remediation, including retry logic and unit tests.
  • Introduce in-process locking for status snapshot read/modify/write operations and add a “mark kubelet healthy after upgrade” status update (+ tests).
  • Replace kubectl get node ... readiness probing with a client-go call via a cached kubelet clientset; add shared kube client helpers (kubelet/admin).

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/status/writer.go Wrap status writes with an in-process mutex; split unlocked write helper.
pkg/status/lock.go New status file mutex helper to prevent lost updates within the agent process.
pkg/status/loader.go Split unlocked load helper for use inside lock-protected sections.
pkg/status/health.go Lock-protected status updates; add “mark healthy after upgrade” helper.
pkg/status/health_test.go Add test ensuring “mark healthy after upgrade” preserves unrelated fields.
pkg/status/collector.go Switch kubelet readiness probing to client-go Nodes().Get with timeout/constants.
pkg/kube/client.go New cached kubelet clientset + admin clientset via AKS management-plane kubeconfig.
pkg/drift/remediation.go Add upgrade step constants, cordon/drain + uncordon steps, and status update after successful upgrade.
pkg/drift/remediation_test.go Add tests for shouldMarkKubeletUnhealthyAfterUpgradeFailure behavior.
pkg/drift/node_maintenance.go New Kubernetes node maintenance implementation (cordon/drain/uncordon) with admin retry and drain helper config.
pkg/drift/node_maintenance_test.go Unit tests for cordon/drain/uncordon executor behavior and admin-retry predicate.
go.mod Add k8s.io/kubectl dependency (and indirect deps).
go.sum Update sums for new/updated dependencies pulled in by kubectl/drain usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

t.Fatalf("start-kubelet failure marked unhealthy=false, want true")
}

// Unknown step -> conservative true.
Force: false,
GracePeriodSeconds: -1,
IgnoreAllDaemonSets: true,
DeleteEmptyDirData: false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this DeleteEmptyDirData: false, mean? A pod using emptyDir volumes won't be able to get drained?

}

m.mu.Lock()
m.client = cs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious would admin kubeconfig ever expire?

return cs, nil
}

func fetchClusterAdminKubeconfig(ctx context.Context, cfg *config.Config) ([]byte, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems duplicated with pkg/bootstrapper/cluster_config_enricher.go , is there a way to share?

}

kubeletClient = cs
kubeletErr = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't think we need this variable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants