Summary
The addTaintBySpec and removeTaintBySpec functions in node_controller.go use client.StrategicMergeFrom() without optimistic locking, which can cause race conditions where concurrent taint modifications from other controllers are silently overwritten.
Background
When NRC adds or removes a taint, it:
- Creates a deep copy of the node at time T1
- Modifies the taints list
- Patches the node
If another controller (e.g., Karpenter, CSI drivers, cloud-controller-manager) modifies the node's taints between T1 and the patch, NRC's patch will overwrite those changes with stale state.
Current Code
https://github.com/kubernetes-sigs/node-readiness-controller/blob/main/internal/controller/node_controller.go#L257-L274
func (r *RuleReadinessController) removeTaintBySpec(ctx context.Context, node *corev1.Node, taintSpec corev1.Taint, ruleName string) error {
patch := client.StrategicMergeFrom(node.DeepCopy()) // No optimistic locking
var newTaints []corev1.Taint
for _, taint := range node.Spec.Taints {
if taint.Key != taintSpec.Key || taint.Effect != taintSpec.Effect {
newTaints = append(newTaints, taint)
}
}
node.Spec.Taints = newTaints
if err := r.Patch(ctx, node, patch); err != nil {
return err
}
// ...
}
Race Condition Example
T1: NRC caches node taints: [karpenter.sh/unregistered, xyz.co/taint]
T2: Karpenter removes karpenter.sh/unregistered → [xyz.co/taint]
T3: NRC removes xyz.co/taint using stale T1 state
→ patches with [karpenter.sh/unregistered] // Karpenter's removal reverted!
This affects any taint modified by other controllers during the window between NRC's read and write.
Proposed Fix
Use MergeFromWithOptions with optimistic locking, as Karpenter does for both add and remove taint operations:
https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/statenode.go#L483-L533
// Karpenter's RequireNoScheduleTaint (handles both add AND remove):
stored := node.DeepCopy()
// ... taint modifications (add or remove) ...
if !equality.Semantic.DeepEqual(stored, node) {
// We use client.MergeFromWithOptimisticLock because patching
// a list with a JSON merge patch can cause races due to the
// fact that it fully replaces the list on a change
return kubeClient.Patch(ctx, node,
client.MergeFromWithOptions(stored, client.MergeFromWithOptimisticLock{}))
}
The fix for NRC would be:
// Fails with conflict if node changed, allowing retry
patch := client.MergeFromWithOptions(node.DeepCopy(), client.MergeFromWithOptimisticLock{})
With optimistic locking, if the node's resourceVersion changed between read and write, the patch fails with a conflict error, allowing the controller to retry with fresh state.
Impact
Nodes stuck with startup taints that should have been removed.
Summary
The
addTaintBySpecandremoveTaintBySpecfunctions innode_controller.gouseclient.StrategicMergeFrom()without optimistic locking, which can cause race conditions where concurrent taint modifications from other controllers are silently overwritten.Background
When NRC adds or removes a taint, it:
If another controller (e.g., Karpenter, CSI drivers, cloud-controller-manager) modifies the node's taints between T1 and the patch, NRC's patch will overwrite those changes with stale state.
Current Code
https://github.com/kubernetes-sigs/node-readiness-controller/blob/main/internal/controller/node_controller.go#L257-L274
Race Condition Example
This affects any taint modified by other controllers during the window between NRC's read and write.
Proposed Fix
Use
MergeFromWithOptionswith optimistic locking, as Karpenter does for both add and remove taint operations:https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/statenode.go#L483-L533
The fix for NRC would be:
With optimistic locking, if the node's
resourceVersionchanged between read and write, the patch fails with a conflict error, allowing the controller to retry with fresh state.Impact
Nodes stuck with startup taints that should have been removed.