sled-agent: Fix races when starting switch zone in a4x2 #9297
+51
−39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Yesterday @internet-diglett and I were looking at some weird a4x2 failures where sled-agent successfully started the switch zone but failed to configure uplinks within it. This PR fixes a race condition and a subsequent logic bug which together were causing that failure.
I'm not sure if it's possible to hit this race condition in real hardware. I tried going over a real sled-agent startup log to figure out if we just happened to start up the switch zone "fast enough", or if something in the real startup path was (implicitly) blocked on that setup being done. I think it's the latter but don't have great confidence in that; this is based on comparing timestamps of logs and things that appear backed up waiting on mutexes held during the whole switch zone setup process. All of this is pretty gnarly; we have multiple issues discussing the need for some rework here anyway, but this is yet another spot fix to unblock active work.
Race condition: If we get our underlay info while we're still starting up the switch zone, we don't inform the task doing that startup about it, and therefore it doesn't attempt to configure uplinks.
We saw these sequence in sled-agent's logs:
This is consistent with initializing starting the switch zone with no underlay information (i.e., we pass
underlay_info=Nonehere):omicron/sled-agent/src/services.rs
Lines 3548 to 3557 in d743754
and then subsequently swapping out the request inside the
SwitchZoneState::Initializingvariant here:omicron/sled-agent/src/services.rs
Lines 3558 to 3566 in d743754
In the "swapping out the request" path, we actually have
Some(underlay_info), but we're discarding it: it's not stored inrequestornew_request- we only passed it as an argument tostart_switch_zone. The first commit, fcf094d, fixes this by moving theunderlay_infointorequestinstead of passing it as function argument. Now when we swap out the request, the task running to perform initialization has access to theunderlay_infoand will attempt to configure uplinks.Logic bug: Once we fixed the above, we saw the "ensure switch zone uplinks" worker stop after a single attempt as though it was told to:
When we first start initializing the switch zone, we spawn a task to do the work and store it in the
::Initializingvariant'sworkerfield:omicron/sled-agent/src/services.rs
Lines 3524 to 3530 in d743754
Once that task gets an
Ok(_)response fromtry_initialize_switch_zone(), it will attempt to configure uplinks until theexitchannel is sent an explicit message or is dropped:omicron/sled-agent/src/services.rs
Lines 4065 to 4073 in d743754
However, inside of
try_initialize_switch_zone()itself, the last thing it does before returning is change the state from::Initializingto::Running, with noworkertask:omicron/sled-agent/src/services.rs
Lines 4006 to 4010 in d743754
This causes
exit_txto be dropped, which causesensure_switch_zone_uplinks_configured_loop()to bail out after a single attempt, as we see in the logs above. This is fixed in 07aa525, which moves the worker task into the::Runningstate instead of dropping it. (The::Runningstate can have a non-Noneworker if we reconfigure the switch zone, so the supporting code already expects this to be present sometimes, and knows to stop the task when appropriate.)This bug was mostly introduced by above (not fully correct!) change to fix the race condition: prior to that change, the
::Initializingstate never had theunderlay_infoin it anyway, soensure_switch_zone_uplinks_configured_loop()wouldn't have even been called.