Skip to content

Fix broken WSLCorePort channel after receive timeout#14455

Open
chemwolf6922 wants to merge 13 commits intomicrosoft:masterfrom
chemwolf6922:fix-broken-state-after-transaction-timeout
Open

Fix broken WSLCorePort channel after receive timeout#14455
chemwolf6922 wants to merge 13 commits intomicrosoft:masterfrom
chemwolf6922:fix-broken-state-after-transaction-timeout

Conversation

@chemwolf6922
Copy link
Contributor

@chemwolf6922 chemwolf6922 commented Mar 17, 2026

Summary of the Pull Request

This pattern shows up in multiple sleep -> wake -> wsl stuck reports:
image

In the current sequence number logic, the receive sequence will ++ without receiving any message. If timeout is allowed on the channel and it's not destroyed, the next receive will always get the N-1 message due to:

  1. The stale message arrived after the timeout. Or,
  2. The reply end never received the request, so it's send counter is N-1.

This will lock the channel in an unusable state.

This PR makes these changes to keep the WSLCorePort channels working after a receive timeout.

  1. Add options to enable an alternative sequence number sync & check logic that:
  2. The reply side always sync up to the request side's latest sequence number.
  3. The request side skips any stale message.

These are only applied to the WSLCorePort channels to reduce risk. Though other channels may face the same problem.

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

Failed unit tests


UnitTests::UnitTests::Warnings [Failed]
Error: Verify: AreEqual(expectedWarnings, warnings) - Values (, wsl: Due to a current compatibility issue with Global Secure Access Client, DNS Tunneling is disabled.
) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::Warnings::<lambda_1>::operator (), Line: 2022]

UnitTests::UnitTests::KernelModules [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D9868B9) Exception(57) tid(8178) 8000FFFF Catastrophic failure
    Msg:[Command "C:\WINDOWS\system32\wsl.exe echo ok"returned unexpected exit code (0 != -1). Stdout: 'ok
'Stderr: 'wsl: The .wslconfig setting 'wsl2.kernel' is disabled by the computer policy.
'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::VersionFlavorParsing [Failed]
WSL1 is disabled by the computer policy.
Please run 'wsl.exe --set-version tmpdistro 2' to upgrade to WSL2.
Error code: Wsl/Service/CreateInstance/WSL_E_WSL1_DISABLED
Error: Verify: AreEqual(LxsstuLaunchWsl(std::format(L"-d {} cat /etc/os-release || true", Distro).c_str()), 0L) - Values (4294967295, 0) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::VersionFlavorParsing::<lambda_1>::operator (), Line: 3924]

UnitTests::UnitTests::CaseSensitivity [Failed]
Verify: IsFalse(getCaseSensitivity(std::format(L"{}/l1/l2/l3-other", testDir)))
Error: Caught std::exception: D:\workspace\WSL\src\windows\common\filesystem.cpp(339)\wsltests.dll!00007FFD0D2E0169: (caller: 00007FFD0D07A634) Exception(11) tid(1a18) C0000101     [`anonymous-namespace'::EnsureCaseSensitiveDirectoryRecursive::<lambda_1>::operator ()(NtSetInformationFile(Directory, &IoStatus, &CaseInfo, sizeof(CaseInfo), FileCaseSensitiveInformation))]

UnitTests::UnitTests::CustomModulesVhd [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D987DAF) Exception(1) tid(3f90) 8000FFFF Catastrophic failure
    Msg:[Command "Powershell -NoProfile -Command "$acl = Get-Acl 'D:\workspace\WSL\test-modules.vhd' ; $acl.RemoveAccessRuleAll((New-Object System.Security.AccessControl.FileSystemAccessRule(\"Everyone\", \"Read\", \"None\", \"None\", \"Allow\"))); Set-Acl -Path 'D:\workspace\WSL\test-modules.vhd' -AclObject $acl""returned unexpected exit code (1 != 0). Stdout: ''Stderr: 'Get-Acl : The 'Get-Acl' command was found in the module 'Microsoft.PowerShell.Security', but the module could not be loaded. For more informa
tion, run 'Import-Module Microsoft.PowerShell.Security'.
At line:1 char:8
+ $acl = Get-Acl 'D:\workspace\WSL\test-modules.vhd' ; $acl.RemoveAcces ...
+        ~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Get-Acl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

You cannot call a method on a null-valued expression.
At line:1 char:54
+ ... ules.vhd' ; $acl.RemoveAccessRuleAll((New-Object System.Security.Acce ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvokeMethodOnNull

Set-Acl : The 'Set-Acl' command was found in the module 'Microsoft.PowerShell.Security', but the module could not be loaded. For more informa
tion, run 'Import-Module Microsoft.PowerShell.Security'.
At line:1 char:190
+ ... sRule("Everyone", "Read", "None", "None", "Allow"))); Set-Acl -Path ' ...
+                                                           ~~~~~~~
    + CategoryInfo          : ObjectNotFound: (Set-Acl:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::BrokenDistroImport [Failed]
Error: Caught std::exception: D:\workspace\WSL\test\windows\Common.cpp(261)\wsltests.dll!00007FFD0D986FF8: (caller: 00007FFD0D987DAF) Exception(1) tid(13dc8) 8000FFFF Catastrophic failure
    Msg:[Command "Powershell -NoProfile -Command "New-Vhd EmptyVhd.vhdx  -SizeBytes 20MB""returned unexpected exit code (1 != 0). Stdout: ''Stderr: 'New-Vhd : Failed to create the virtual hard disk.
The system failed to create 'D:\workspace\WSL\EmptyVhd.vhdx'.
Failed to create the virtual hard disk.
The system failed to create 'D:\workspace\WSL\EmptyVhd.vhdx': The file exists. (0x80070050).
At line:1 char:1
+ New-Vhd EmptyVhd.vhdx  -SizeBytes 20MB
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [New-VHD], VirtualizationException
    + FullyQualifiedErrorId : OperationFailed,Microsoft.Vhd.PowerShell.Cmdlets.NewVhd

'] [LxsstuLaunchCommandAndCaptureOutput(E_UNEXPECTED)]

UnitTests::UnitTests::WslDebug [Failed]
wsl: The .wslconfig setting 'wsl2.kernelCommandLine' is disabled by the computer policy.
Error: Verify: AreEqual(LxsstuLaunchWsl(L"dmesg | grep -iF 'vmbus_send_tl_connect_request'"), 0L) - Values (1, 0) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::WslDebug, Line: 6425]

UnitTests::UnitTests::CGroupv1 [Failed]
Error: Verify: AreEqual(out, expected) - Values (/sys/fs/cgroup/unified cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate
, ) [File: D:\workspace\WSL\test\windows\UnitTests.cpp, Function: UnitTests::UnitTests::CGroupv1::<lambda_1>::operator (), Line: 6435]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a broken channel state that occurs after a transaction timeout in WSL's socket-based IPC protocol. The issue (#14193, #14055) manifests after laptop sleep/hibernate, where a channel's expected sequence number gets desynchronized, causing all subsequent communication to fail until wsl --shutdown.

Changes:

  • Replace independent sender/receiver sequence counters with an echo-back mechanism: the responder echoes back the request's sequence number in its reply, preventing desync after timeouts.
  • Add a magic number field to MESSAGE_HEADER for early framing corruption detection, and skip stale (timed-out) replies in the receive loop.
  • Zero-initialize a Reply union in binfmt.cpp to ensure the new MessageMagic default initializer doesn't cause issues with raw read() calls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/shared/inc/lxinitshared.h Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for new struct size.
src/shared/inc/SocketChannel.h Rewrote send/receive sequence logic to echo-back model; added stale message skipping loop; replaced m_received_messages with m_expected_reply_sequence / m_pending_reply_sequence.
src/shared/inc/socketshared.h Added magic number validation in RecvMessage before processing header.
src/linux/init/binfmt.cpp Zero-initialized Reply union to handle new MessageMagic default member initializer.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 17, 2026 09:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a broken channel state issue in WSL's SocketChannel that occurs after a transaction timeout (e.g., when resuming from sleep). Previously, a timeout would increment the expected message ID on the receiver side, but the sender wouldn't use that incremented ID, causing a permanent ID desync and locking the channel. The fix replaces independent sequence tracking with an echo-back mechanism where the responder echoes back the request's sequence number in its reply, and the requester skips stale replies from previously timed-out transactions.

Changes:

  • Added a magic number field to MESSAGE_HEADER and validated it in RecvMessage to detect framing corruption early.
  • Replaced independent sequence counters with an echo-back sequence mechanism in SocketChannel using m_expected_reply_sequence and m_pending_reply_sequence, with a loop to skip stale replies.
  • Zero-initialized a union in binfmt.cpp to ensure the new MessageMagic field is properly initialized when reading responses.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/shared/inc/lxinitshared.h Added Magic constant and MessageMagic field to MESSAGE_HEADER; updated static_assert for LX_GNS_SET_PORT_LISTENER size.
src/shared/inc/socketshared.h Added magic number validation in RecvMessage after reading the header.
src/shared/inc/SocketChannel.h Replaced send/receive sequence tracking with echo-back mechanism; added stale-reply skip loop; removed sequence parameter from ValidateMessageHeader.
src/linux/init/binfmt.cpp Zero-initialized Reply union to ensure MessageMagic defaults correctly.

You can also share your feedback on Copilot code review. Take the survey.

Feng Wang added 2 commits March 17, 2026 17:31
…om:chemwolf6922/WSL into fix-broken-state-after-transaction-timeout
Copilot AI review requested due to automatic review settings March 20, 2026 03:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent SocketChannel protocol desynchronization after a transaction timeout (the “expected sequence” advancing while a late response with the previous sequence arrives), which can leave a channel in a permanently broken state.

Changes:

  • Reworks SocketChannel sequencing to echo request sequence numbers in replies and discard stale replies.
  • Adds new per-channel state (m_expected_reply_sequence / m_pending_reply_sequence) to track request/reply sequencing.
  • Updates protocol error handling to validate type separately from sequencing.

Copilot AI review requested due to automatic review settings March 20, 2026 05:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings March 20, 2026 07:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Copilot AI review requested due to automatic review settings March 20, 2026 09:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

@chemwolf6922 chemwolf6922 changed the title Fix potential broken state after transaction timeout Fix potential broken state after receive timeout Mar 20, 2026
@chemwolf6922 chemwolf6922 changed the title Fix potential broken state after receive timeout Fix broken WSLCorePort channel after receive timeout Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants