Fix broken WSLCorePort channel after receive timeout#14455
Fix broken WSLCorePort channel after receive timeout#14455chemwolf6922 wants to merge 13 commits intomicrosoft:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a broken channel state that occurs after a transaction timeout in WSL's socket-based IPC protocol. The issue (#14193, #14055) manifests after laptop sleep/hibernate, where a channel's expected sequence number gets desynchronized, causing all subsequent communication to fail until wsl --shutdown.
Changes:
- Replace independent sender/receiver sequence counters with an echo-back mechanism: the responder echoes back the request's sequence number in its reply, preventing desync after timeouts.
- Add a magic number field to
MESSAGE_HEADERfor early framing corruption detection, and skip stale (timed-out) replies in the receive loop. - Zero-initialize a
Replyunion inbinfmt.cppto ensure the newMessageMagicdefault initializer doesn't cause issues with rawread()calls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/shared/inc/lxinitshared.h |
Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for new struct size. |
src/shared/inc/SocketChannel.h |
Rewrote send/receive sequence logic to echo-back model; added stale message skipping loop; replaced m_received_messages with m_expected_reply_sequence / m_pending_reply_sequence. |
src/shared/inc/socketshared.h |
Added magic number validation in RecvMessage before processing header. |
src/linux/init/binfmt.cpp |
Zero-initialized Reply union to handle new MessageMagic default member initializer. |
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a broken channel state issue in WSL's SocketChannel that occurs after a transaction timeout (e.g., when resuming from sleep). Previously, a timeout would increment the expected message ID on the receiver side, but the sender wouldn't use that incremented ID, causing a permanent ID desync and locking the channel. The fix replaces independent sequence tracking with an echo-back mechanism where the responder echoes back the request's sequence number in its reply, and the requester skips stale replies from previously timed-out transactions.
Changes:
- Added a magic number field to
MESSAGE_HEADERand validated it inRecvMessageto detect framing corruption early. - Replaced independent sequence counters with an echo-back sequence mechanism in
SocketChannelusingm_expected_reply_sequenceandm_pending_reply_sequence, with a loop to skip stale replies. - Zero-initialized a union in
binfmt.cppto ensure the newMessageMagicfield is properly initialized when reading responses.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/shared/inc/lxinitshared.h |
Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for LX_GNS_SET_PORT_LISTENER size. |
src/shared/inc/socketshared.h |
Added magic number validation in RecvMessage after reading the header. |
src/shared/inc/SocketChannel.h |
Replaced send/receive sequence tracking with echo-back mechanism; added stale-reply skip loop; removed sequence parameter from ValidateMessageHeader. |
src/linux/init/binfmt.cpp |
Zero-initialized Reply union to ensure MessageMagic defaults correctly. |
You can also share your feedback on Copilot code review. Take the survey.
…om:chemwolf6922/WSL into fix-broken-state-after-transaction-timeout
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent SocketChannel protocol desynchronization after a transaction timeout (the “expected sequence” advancing while a late response with the previous sequence arrives), which can leave a channel in a permanently broken state.
Changes:
- Reworks SocketChannel sequencing to echo request sequence numbers in replies and discard stale replies.
- Adds new per-channel state (
m_expected_reply_sequence/m_pending_reply_sequence) to track request/reply sequencing. - Updates protocol error handling to validate type separately from sequencing.
Summary of the Pull Request
This pattern shows up in multiple sleep -> wake -> wsl stuck reports:

In the current sequence number logic, the receive sequence will ++ without receiving any message. If timeout is allowed on the channel and it's not destroyed, the next receive will always get the N-1 message due to:
This will lock the channel in an unusable state.
This PR makes these changes to keep the WSLCorePort channels working after a receive timeout.
These are only applied to the WSLCorePort channels to reduce risk. Though other channels may face the same problem.
PR Checklist
Closes: WSL2 crashes on waking up from sleep #14193 WSL 2.6.3.0: Terminal crash after hibernation/sleep with [process exited with code 1] #14055 Error code: Wsl/Service/E_UNEXPECTED #14014
Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
Tests: Added/updated if needed and all pass
All but 8 unit test fails. Where 6 of them failed because of GPO settings or powershell issues. 2 of them (CGroupv1 and CaseSensitivity) failed but seems unrelated to the changes. I have appended logs for those failed tests in the end.
I'm also dog fooding this build right now.
Localization: All end user facing strings can be localized
Dev docs: Added/updated if needed
Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx
Detailed Description of the Pull Request / Additional comments
Validation Steps Performed
Failed unit tests