feat: implement node locking for NodeSet worker pods#130
feat: implement node locking for NodeSet worker pods#130giuliocalzo wants to merge 1 commit intoSlinkyProject:mainfrom
Conversation
|
good morning @vivian-hafener I rebase and adjust based on the last pre-commit checks, feel free to review it |
|
Good afternoon @giuliocalzo, Was this resolved by 0ce9ef7? Best regards, |
|
hi @vivian-hafener the deamonset implement natively the 1:1 node/pod affinity, but the statefulset does not have this feature, this PR is implementing this feature |
|
@SkylerMalinowski I've just rebased and sort some conflict |
SkylerMalinowski
left a comment
There was a problem hiding this comment.
Let's assume LockNodes=true, LockNodeLifetime=0s, ScalingMode=StatefulSet, and pod-0 is assigned to node-0. If I were to node autoscale my Kubernetes cluster down such that node-0 is scaled-in, the node lock on node-0 would never expire and the scaling logic would create pod-0 with strict affinity for node-0 which no longer exists, hence will never run.
With respect to the NodeAssignments map, is it better to clear a node assignment where the Kube node is NotFound, or have the scaling logic not create a pod with strict affinity for a node that is NotFound? I'm thinking the former case is best to avoid defunct NodeAssignments ever increasing the map and potentially going over etcd entry limits, but the latter case would best respect LockNodeLifetime=0s.
| // Only used when lockNodes is true. | ||
| // +optional | ||
| // +default:=0 | ||
| LockNodeLifetime int32 `json:"lockNodeLifetime,omitempty"` |
There was a problem hiding this comment.
Use a Duration for rich value expressions.
| LockNodeLifetime int32 `json:"lockNodeLifetime,omitempty"` | |
| LockNodeLifetime metav1.Duration `json:"lockNodeLifetime,omitempty"` |
| validOrdinals := make(map[string]struct{}, replicaCount) | ||
| for i := range replicaCount { | ||
| validOrdinals[strconv.Itoa(i)] = struct{}{} | ||
| } |
There was a problem hiding this comment.
This is not correct. Unlike appsv1.StatefulSet, NodeSet does not strictly scale-in from highest ordinal to lowest in order. Gaps are allowed and frequent during scale-in.
Let's assume replicas=3 with an initial state of pod-{0,1,2}. Let's say you set replicas=2 and the NodeSet controller gracefully terminated pod-1. Your code would say that pod-2 is invalid despite being perfectly valid, and would be erroneously cleared from assignments map.
|
This functionality was merged in 34f08a2 , but the implementation significantly deviated from your PR. |
Summary
Add lockNodes and lockNodeLifetime fields to NodeSetSpec to pin worker pods to their assigned Kubernetes nodes. When enabled, the controller records each pod-to-node mapping in NodeSetStatus and injects a requiredDuringSchedulingIgnoredDuringExecution NodeAffinity on pod recreation so each worker always returns to the same physical node.
The lockNodeLifetime field controls how long the lock persists: 0 means permanent, and a positive value (in seconds) causes the lock to expire after the pod stops running, allowing it to reschedule freely. Running pods continuously refresh their assignment timestamp so the countdown only begins once the pod is no longer active on the node.
Breaking Changes
none
Testing Notes
local testing with kind
Additional Context