Skip to content

Fix: Prevent Nil Pointer Panic When Node Zone Label Changes During Processing#2500

Open
Yashika0724 wants to merge 1 commit intoopenyurtio:masterfrom
Yashika0724:fix-node-lifecycle-nil-panic
Open

Fix: Prevent Nil Pointer Panic When Node Zone Label Changes During Processing#2500
Yashika0724 wants to merge 1 commit intoopenyurtio:masterfrom
Yashika0724:fix-node-lifecycle-nil-panic

Conversation

@Yashika0724
Copy link

Summary

This PR fixes a nil pointer panic in the Node Lifecycle Controller that can occur when a node’s
zone label changes between initial classification and retry-based reconciliation.

During retries, the controller may re-fetch the Node object from the API server with updated
topology labels, but the new zone may not be registered in the eviction queues. This results in
accessing a nil entry in zoneNoExecuteTainter, which causes the controller to panic and crash.

This fix adds defensive handling to safely register new zones when encountered and avoids invalid
map access.


Root Cause

Zones are registered only during initial node classification via addPodEvictorForNewZone.
However, in updateNodeFunc, the controller workflow can be:

  1. Node is copied from the initial node list
  2. Health update fails and retry is triggered
  3. Node is re-fetched from the API server
  4. Node’s zone label has changed between list and retry

In this case, the new zone is not present in zoneNoExecuteTainter.

When markNodeForTainting or markNodeAsReachable executes:

go nc.zoneNoExecuteTainter[nodetopology.GetZoneKey(node)]

the map entry can be nil, leading to:

panic: runtime error: invalid memory address or nil pointer dereference

which crashes the entire Node Lifecycle Controller.


Steps to Reproduce

1.	Deploy OpenYurt with multiple zones (e.g. topology.kubernetes.io/zone=zone-a)
2.	Ensure Node Lifecycle Controller is running
3.	Introduce API latency or network instability to trigger retry loops
4.	While a node is being processed, update its zone label:
              kubectl label node <node-name> topology.kubernetes.io/zone=zone-new --overwrite
5.	Controller panics when accessing eviction queue for the new zone

Alternative reproduction:
• Use a mutating webhook that modifies node zone labels
• Create node churn to increase reconciliation retries


Fix Applied

markNodeForTainting
• Extract zone key once at function start
• Check if the zone is registered
• Dynamically register missing zones using the same logic as initial classification
• Safely access eviction queues after registration

markNodeAsReachable
• Check if zone exists before removing node from eviction queue
• If zone is not registered, return safely with no-op

This prevents nil map access while preserving existing taint and eviction behavior.


Impact

•	Prevents controller-wide crashes
•	Improves resilience when node labels change dynamically
•	Safe handling of retries during reconciliation
•	No behavioral change to taint or eviction logic

Testing

•	Code compiles locally
•	Follows existing zone initialization patterns
•	CI will validate controller integration behavior

Add defensive checks in markNodeForTainting and markNodeAsReachable
to handle cases where a node's zone label changes during processing.

Signed-off-by: Yashika0724 <ssyashika1311@gmail.com>
@Yashika0724 Yashika0724 requested a review from a team as a code owner January 20, 2026 20:36
@sonarqubecloud
Copy link

@Yashika0724
Copy link
Author

Hi @zyjhtangtang , this PR fixes a controller panic caused by zone label changes during retries by
safely handling missing zone registrations. I’d appreciate a review and any feedback. Thanks!

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 40.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.06%. Comparing base (b0c90be) to head (3a6bba7).

Files with missing lines Patch % Lines
...roller/nodelifecycle/node_life_cycle_controller.go 40.00% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2500      +/-   ##
==========================================
- Coverage   44.08%   44.06%   -0.03%     
==========================================
  Files         399      399              
  Lines       26560    26571      +11     
==========================================
- Hits        11710    11709       -1     
- Misses      13788    13797       +9     
- Partials     1062     1065       +3     
Flag Coverage Δ
unittests 44.06% <40.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zyjhtangtang
Copy link

@Yashika0724 Please focus on the blocking issues in the code detection.

@Yashika0724
Copy link
Author

@Yashika0724 Please focus on the blocking issues in the code detection.

Thanks for pointing that out.
I’ll check the failing CI and coverage issues and update the PR accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants