[ThreadPool] Scalability Experiments follow up. Part1 #122726

VSadov · 2025-12-24T20:41:26Z

This is a follow up on recommendations from Scalability Experiments done some time ago.

The Scalability Experiments resulted in many suggestions. In this part we look at overheads of submitting and executing a workitem to the threadpool from the thread scheduling point of view. In particular - this PR tries to minimize changes to the workqueue to scope the changes.
The workqueue related recommendations will be addressed separately.

The threadpool parts are very interconnected though, and sometimes removing one bottleneck results in another one to show up, so some workqueue changes had to be done, just to avoid regressions.

There are also a few "low hanging fruit" fixes for per-workitem overheads like unnecessary fences or too frequent modifications of shared state.
Hopefully this will negate some of the regressions from #121887 (as was reported in #122186)

In this change:

fewer operations per work item where possible.
such as fewer/weaker fences where possible, reporting heartbeat once per dispatch quantum vs. per each workitem, etc..
avoid spurious wakes of worker threads. (except, unavoidably, when thread goal is changed - by HillClimb and such).
only one thread is requested at a time. requesting another thread is conditioned on evidence of work present in the queue (basically the minimum required for correctness).
as a result a thread that becomes active typically finds work.
in particular this avoids a cascade of spurious wakes when pool is running out of workitems.
stop tracking spinners in LIFO semaphore.
we can keep track of spinners, but informational value of knowing the spinner count is close to zero, so we should not.
no Sleep in LIFO semaphore.
using spin-Sleep is questionable in a synchronization feature that can block and ask OS to wake a thread deterministically.
shortening spinning in the LIFO semaphore to a more affordable value.
since the LIFO semaphore can perform a blocking wait until condition it wants to see, once spinning gets into the range of wait/wake latency, it makes no sense to spin for much longer.
it is also not uncommon that the work is introduced by non-pool threads, thus the pool threads may need to start blocking in order to see more work.

dotnet-policy-service · 2025-12-24T20:46:55Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

VSadov · 2025-12-25T00:24:41Z

As a benchmark to measure per-task overhead I use a subset from the following benchmark https://github.com/benaadams/ThreadPoolTaskTesting

Results as measured on Win11 with AMD 7950X 16-Core

The following is set to reduce possible noise:
DOTNET_TieredCompilation=0
DOTNET_GCDynamicAdaptationMode=0

Measurement is in number of tasks per second. Higher is beter.

===== Baseline:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.016 M     3.122 M     3.027 M     3.048 M     3.037 M     3.014 M     2.992 M     2.997 M     3.003 M
- Depth    2                     3.005 M     2.958 M     3.026 M     3.002 M     2.935 M     2.970 M     2.990 M     2.977 M     3.043 M
- Depth    4                     2.901 M     2.843 M     2.972 M     2.967 M     3.019 M     2.980 M     3.004 M     2.991 M     3.010 M
- Depth    8                     2.965 M     2.919 M     2.902 M     2.734 M     2.902 M     2.922 M     2.913 M     2.927 M     2.937 M
- Depth   16                     2.934 M     2.914 M     2.820 M     2.896 M     2.917 M     2.928 M     2.932 M     2.877 M     2.910 M
- Depth   32                     2.893 M     2.871 M     2.887 M     2.898 M     2.892 M     2.912 M     2.916 M     2.899 M     2.892 M
- Depth   64                     2.894 M     2.867 M     2.888 M     2.881 M     2.877 M     2.887 M     2.879 M     2.883 M     2.915 M
- Depth  128                     2.902 M     2.882 M     2.917 M     2.908 M     2.908 M     2.904 M     2.915 M     2.904 M     2.901 M
- Depth  512                     2.925 M     2.914 M     2.921 M     2.924 M     2.925 M     2.905 M     2.915 M     2.911 M     2.942 M


QUWI Queue Local (TP)            4.799 M     6.223 M    10.593 M    17.118 M    32.025 M    29.635 M    33.384 M    34.646 M    42.084 M
- Depth    2                     6.213 M    10.461 M    14.443 M    21.303 M    32.522 M    38.689 M    39.372 M    40.590 M    43.064 M
- Depth    4                     9.471 M    14.051 M    21.886 M    32.072 M    39.414 M    44.118 M    44.708 M    44.957 M    45.676 M
- Depth    8                    14.232 M    21.544 M    33.413 M    38.438 M    42.951 M    46.537 M    46.760 M    46.946 M    46.824 M
- Depth   16                    21.784 M    33.438 M    37.524 M    41.762 M    45.507 M    47.363 M    47.675 M    47.967 M    48.034 M
- Depth   32                    33.545 M    40.413 M    43.498 M    46.019 M    48.190 M    48.020 M    48.061 M    48.373 M    47.901 M
- Depth   64                    40.034 M    43.087 M    45.389 M    47.332 M    48.465 M    49.146 M    49.223 M    48.879 M    49.355 M
- Depth  128                    42.980 M    46.443 M    47.577 M    48.383 M    48.804 M    48.962 M    48.755 M    49.365 M    49.517 M
- Depth  512                    47.251 M    49.185 M    49.091 M    49.134 M    49.473 M    49.010 M    49.398 M    49.381 M    49.238 M

With this PR:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.063 M     3.083 M     3.047 M     3.063 M     3.003 M     3.033 M     3.037 M     3.011 M     3.026 M
- Depth    2                     3.017 M     2.908 M     2.952 M     3.017 M     3.027 M     2.948 M     2.968 M     3.000 M     3.044 M
- Depth    4                     2.925 M     2.977 M     3.024 M     3.008 M     2.979 M     2.980 M     3.004 M     3.022 M     2.991 M
- Depth    8                     2.990 M     2.927 M     2.821 M     2.721 M     2.915 M     2.951 M     2.942 M     2.915 M     2.975 M
- Depth   16                     2.942 M     2.967 M     2.911 M     2.912 M     3.021 M     2.941 M     2.901 M     2.963 M     2.928 M
- Depth   32                     2.961 M     2.944 M     2.948 M     2.944 M     2.986 M     2.954 M     2.937 M     2.939 M     2.936 M
- Depth   64                     2.968 M     2.952 M     2.948 M     2.960 M     2.959 M     2.944 M     2.949 M     2.951 M     2.956 M
- Depth  128                     2.964 M     2.966 M     2.963 M     2.966 M     2.958 M     2.973 M     2.975 M     2.961 M     2.963 M
- Depth  512                     2.990 M     2.968 M     2.971 M     2.988 M     2.963 M     2.979 M     2.977 M     2.984 M     2.984 M


QUWI Queue Local (TP)            5.492 M    10.456 M    17.804 M    18.884 M    48.797 M   127.532 M   162.716 M   158.277 M   214.897 M
- Depth    2                    11.165 M    19.263 M    17.196 M    29.374 M    76.102 M   161.291 M   160.480 M   178.629 M   209.432 M
- Depth    4                    19.652 M    19.699 M    25.000 M    53.820 M   101.990 M   171.157 M   176.565 M   192.625 M   214.042 M
- Depth    8                    23.519 M    25.474 M    37.569 M    91.185 M   136.847 M   157.626 M   183.437 M   199.642 M   204.375 M
- Depth   16                    27.862 M    41.280 M    76.235 M   118.098 M   159.696 M   200.514 M   197.469 M   209.445 M   211.314 M
- Depth   32                    40.314 M    77.313 M   115.200 M   150.681 M   187.060 M   204.254 M   201.597 M   205.071 M   211.376 M
- Depth   64                    73.297 M   139.082 M   172.258 M   176.718 M   199.829 M   218.152 M   205.455 M   205.482 M   213.325 M
- Depth  128                   133.615 M   176.833 M   186.944 M   199.672 M   205.262 M   207.053 M   201.800 M   211.560 M   215.708 M
- Depth  512                   192.360 M   210.508 M   210.339 M   217.688 M   211.859 M   208.360 M   212.702 M   212.503 M   212.348 M

In QUWI Queue Local we are able to execute a lot more workitems per second.

QUWI No Queues is bottlenecked on the FIFO workitem queue. This PR does not address that part, thus no benefits from concurrency, depth or anything else.

NOTE: this is a microbenchmark! These tasks are very trivial, on the level of "increment a counter".
The benchmark is a good tool for checking on per-task overheads and bottlenecks.
The improvements will vary for actual scenarios where workitems do more work, compared to the benchmark.

VSadov · 2025-12-25T00:41:18Z

For reference - the same as above, but with
set DOTNET_ThreadPool_UseWindowsThreadPool=1

In this case the task queue is the same, but the thread management is done by OS.
In particular there is not HIllClimbing and other similar things, thus per-task expenses are a bit less to start with.
(there are downsides with using OS threadpool, but per-task expenses are less, at least currently)

This variant benefits more from the per-workitem improvements in the PR.

=== Baseline:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.075 M     3.068 M     3.061 M     3.077 M     3.030 M     3.082 M     3.058 M     3.056 M     3.045 M
- Depth    2                     3.016 M     3.010 M     3.037 M     2.961 M     3.011 M     3.033 M     2.937 M     2.958 M     3.001 M
- Depth    4                     2.945 M     2.997 M     2.983 M     2.987 M     2.990 M     2.975 M     2.990 M     2.965 M     2.942 M
- Depth    8                     2.963 M     2.909 M     2.953 M     2.977 M     2.912 M     2.991 M     2.947 M     3.001 M     2.977 M
- Depth   16                     2.983 M     2.932 M     2.875 M     2.962 M     2.957 M     2.975 M     2.974 M     2.965 M     2.949 M
- Depth   32                     2.963 M     2.961 M     2.962 M     2.945 M     2.939 M     2.963 M     2.952 M     2.955 M     2.958 M
- Depth   64                     2.951 M     2.955 M     2.957 M     2.962 M     2.947 M     2.959 M     2.944 M     2.957 M     2.960 M
- Depth  128                     2.956 M     2.972 M     2.972 M     2.972 M     2.969 M     2.961 M     2.967 M     2.966 M     2.967 M
- Depth  512                     2.964 M     2.968 M     2.966 M     2.965 M     2.964 M     2.963 M     2.974 M     2.968 M     2.971 M


QUWI Queue Local (TP)            7.631 M    15.943 M    22.686 M    30.837 M    35.689 M    45.171 M    50.381 M    52.533 M    55.251 M
- Depth    2                    12.866 M    21.285 M    32.843 M    27.394 M    39.129 M    51.520 M    52.513 M    53.210 M    54.110 M
- Depth    4                    22.034 M    31.817 M    29.279 M    37.944 M    44.660 M    53.304 M    54.043 M    54.486 M    55.104 M
- Depth    8                    36.834 M    36.869 M    40.448 M    44.281 M    50.549 M    55.034 M    55.973 M    55.923 M    55.973 M
- Depth   16                    45.172 M    40.375 M    44.735 M    48.158 M    52.005 M    56.377 M    56.223 M    56.254 M    56.111 M
- Depth   32                    38.088 M    45.396 M    47.299 M    50.493 M    54.404 M    56.791 M    56.596 M    56.820 M    56.028 M
- Depth   64                    44.804 M    48.991 M    51.455 M    53.755 M    55.848 M    56.931 M    57.380 M    56.553 M    56.975 M
- Depth  128                    48.775 M    51.866 M    53.985 M    55.647 M    56.168 M    57.248 M    57.098 M    57.415 M    56.359 M
- Depth  512                    55.442 M    55.050 M    56.782 M    56.497 M    57.615 M    57.055 M    57.175 M    56.792 M    56.774 M

With this PR:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.063 M     3.110 M     3.060 M     3.041 M     3.050 M     3.056 M     3.057 M     3.069 M     3.038 M
- Depth    2                     3.047 M     3.026 M     3.013 M     3.029 M     2.968 M     2.969 M     3.021 M     3.003 M     3.006 M
- Depth    4                     2.947 M     3.036 M     3.076 M     3.012 M     3.041 M     3.028 M     3.011 M     3.037 M     3.053 M
- Depth    8                     2.974 M     3.015 M     3.012 M     3.025 M     3.020 M     3.026 M     3.009 M     3.004 M     3.059 M
- Depth   16                     3.014 M     3.031 M     2.959 M     3.043 M     3.017 M     3.012 M     3.015 M     3.021 M     3.034 M
- Depth   32                     3.015 M     2.995 M     3.037 M     3.014 M     3.022 M     2.991 M     3.030 M     2.994 M     3.041 M
- Depth   64                     2.998 M     3.024 M     3.023 M     3.004 M     3.025 M     3.019 M     3.015 M     2.992 M     3.000 M
- Depth  128                     3.008 M     3.003 M     3.005 M     3.006 M     3.010 M     3.006 M     3.000 M     3.004 M     3.004 M
- Depth  512                     3.013 M     3.010 M     3.015 M     3.013 M     3.015 M     3.015 M     3.010 M     3.018 M     3.030 M


QUWI Queue Local (TP)            7.077 M    13.085 M    27.234 M    33.498 M    54.361 M   163.725 M   223.814 M   230.576 M   238.356 M
- Depth    2                    13.710 M    24.174 M    37.391 M    43.846 M    84.419 M   239.729 M   234.524 M   247.397 M   248.473 M
- Depth    4                    22.145 M    34.966 M    52.797 M    62.646 M   126.954 M   193.050 M   222.039 M   255.051 M   247.278 M
- Depth    8                    35.400 M    56.054 M    65.657 M    93.659 M   166.261 M   239.907 M   252.275 M   235.815 M   259.223 M
- Depth   16                    55.728 M    70.509 M   101.264 M   141.210 M   196.271 M   258.734 M   261.945 M   257.276 M   259.935 M
- Depth   32                    70.777 M    81.802 M   133.855 M   175.980 M   233.571 M   248.360 M   256.966 M   261.785 M   263.864 M
- Depth   64                    89.869 M   144.095 M   203.279 M   207.784 M   243.898 M   261.001 M   256.780 M   264.602 M   261.545 M
- Depth  128                   146.725 M   198.235 M   212.273 M   229.452 M   254.920 M   259.824 M   261.772 M   261.219 M   262.685 M
- Depth  512                   239.916 M   253.174 M   258.385 M   258.077 M   262.625 M   265.057 M   264.875 M   264.383 M   262.972 M

Copilot

Pull request overview

This PR implements performance improvements to the thread pool by reducing per-work-item overhead and minimizing spurious thread wakeups. The primary focus is on optimizing thread scheduling and synchronization while maintaining correctness.

Key changes include:

Introducing a single outstanding thread request flag (_hasOutstandingThreadRequest) to replace the counter-based approach, preventing thundering herd issues
Reducing memory barriers and fence operations in critical paths where volatile semantics provide sufficient guarantees
Simplifying semaphore spinning logic and removing spinner tracking overhead
Implementing exponential backoff for contended interlocked operations
Deferring work queue assignment to reduce lock contention during dispatch startup

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
WindowsThreadPool.cs	Adds CacheLineSeparated struct with thread request flag and implements EnsureWorkerRequested() with check-exchange pattern to prevent duplicate requests
ThreadPoolWorkQueue.cs	Removes unnecessary volatile writes in LocalPush, eliminates _mayHaveHighPriorityWorkItems flag, defers queue assignment, and updates thread request calls
ThreadPool.Windows.cs	Changes YieldFromDispatchLoop to accept currentTickCount parameter and renames RequestWorkerThread to EnsureWorkerRequested
ThreadPool.Unix.cs	Updates YieldFromDispatchLoop signature to call NotifyDispatchProgress and renames RequestWorkerThread
ThreadPool.Wasi.cs	Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
ThreadPool.Browser.cs	Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
ThreadPool.Browser.Threads.cs	Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
PortableThreadPool.cs	Renames lastDequeueTime to lastDispatchTime, replaces numRequestedWorkers with _hasOutstandingThreadRequest, refactors NotifyWorkItemProgress methods
PortableThreadPool.WorkerThread.cs	Reduces semaphore spin count from 70 to 9, refactors WorkerDoWork to use check-exchange pattern, implements TryRemoveWorkingWorker with overflow handling
PortableThreadPool.ThreadCounts.cs	Adds IsOverflow property and TryIncrement/DecrementProcessingWork methods to manage overflow state using high bit of _data
PortableThreadPool.GateThread.cs	Updates starvation detection to use _hasOutstandingThreadRequest and lastDispatchTime
PortableThreadPool.Blocking.cs	Updates blocking adjustment logic to check _hasOutstandingThreadRequest
LowLevelLifoSemaphore.cs	Removes spinner tracking, changes Release(count) to Signal(), simplifies Wait() to remove spinWait parameter, restructures Counts bit layout
Backoff.cs	Introduces new exponential backoff utility using stack address-based randomization for collision retry scenarios
System.Private.CoreLib.Shared.projitems	Adds Backoff.cs to project compilation

src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.ThreadCounts.cs

src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs

src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

VSadov and others added 8 commits December 20, 2025 20:02

various tweaks

ba64fcf

Since we can wait until signaled, we should not sleep.

948c93d

Avoid spurious wakes of workers.

f122747

tidy up and comments

737ad47

simplify LIFO semaphore

462392b

Exponential backoff

ef3e6aa

comments

ee93c4b

delay assigning of assignedGlobalWorkItemQueue

0ac85ee

VSadov added the area-System.Threading label Dec 24, 2025

dotnet-policy-service bot assigned VSadov Dec 24, 2025

VSadov marked this pull request as ready for review December 25, 2025 01:26

Copilot AI review requested due to automatic review settings December 25, 2025 01:27

Copilot started reviewing on behalf of VSadov December 25, 2025 01:27 View session

Copilot AI reviewed Dec 25, 2025

View reviewed changes

VSadov and others added 2 commits December 24, 2025 22:29

Apply suggestions from code review

4d4e2a6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestions from code review

a4111c2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

This was referenced Dec 25, 2025

Test failure BasicEventSourceTests.TestsUserErrors.Test_BadEventSource_MismatchedIds_WithEtwListener #96968

Open

System.Runtime.InteropServices.COMException.Test_Write_T_ETW fails on Windows x64 #122593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ThreadPool] Scalability Experiments follow up. Part1 #122726

[ThreadPool] Scalability Experiments follow up. Part1 #122726

VSadov commented Dec 24, 2025 •

edited

Loading

Uh oh!

dotnet-policy-service bot commented Dec 24, 2025

Uh oh!

VSadov commented Dec 25, 2025 •

edited

Loading

Uh oh!

VSadov commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[ThreadPool] Scalability Experiments follow up. Part1 #122726

Are you sure you want to change the base?

[ThreadPool] Scalability Experiments follow up. Part1 #122726

Conversation

VSadov commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented Dec 24, 2025

Uh oh!

VSadov commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VSadov commented Dec 24, 2025 •

edited

Loading

VSadov commented Dec 25, 2025 •

edited

Loading