Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Dec 24, 2025

This is a follow up on recommendations from Scalability Experiments done some time ago.

The Scalability Experiments resulted in many suggestions. In this part we look at overheads of submitting and executing a workitem to the threadpool from the thread scheduling point of view. In particular - this PR tries to minimize changes to the workqueue to scope the changes.
The workqueue related recommendations will be addressed separately.

The threadpool parts are very interconnected though, and sometimes removing one bottleneck results in another one to show up, so some workqueue changes had to be done, just to avoid regressions.

There are also a few "low hanging fruit" fixes for per-workitem overheads like unnecessary fences or too frequent modifications of shared state.
Hopefully this will negate some of the regressions from #121887 (as was reported in #122186)

In this change:

  • fewer operations per work item where possible.
    such as fewer/weaker fences where possible, reporting heartbeat once per dispatch quantum vs. per each workitem, etc..

  • avoid spurious wakes of worker threads. (except, unavoidably, when thread goal is changed - by HillClimb and such).
    only one thread is requested at a time. requesting another thread is conditioned on evidence of work present in the queue (basically the minimum required for correctness).
    as a result a thread that becomes active typically finds work.
    in particular this avoids a cascade of spurious wakes when pool is running out of workitems.

  • stop tracking spinners in LIFO semaphore.
    we can keep track of spinners, but informational value of knowing the spinner count is close to zero, so we should not.

  • no Sleep in LIFO semaphore.
    using spin-Sleep is questionable in a synchronization feature that can block and ask OS to wake a thread deterministically.

  • shortening spinning in the LIFO semaphore to a more affordable value.
    since the LIFO semaphore can perform a blocking wait until condition it wants to see, once spinning gets into the range of wait/wake latency, it makes no sense to spin for much longer.
    it is also not uncommon that the work is introduced by non-pool threads, thus the pool threads may need to start blocking in order to see more work.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@VSadov
Copy link
Member Author

VSadov commented Dec 25, 2025

As a benchmark to measure per-task overhead I use a subset from the following benchmark https://github.com/benaadams/ThreadPoolTaskTesting

Results as measured on Win11 with AMD 7950X 16-Core

The following is set to reduce possible noise:
DOTNET_TieredCompilation=0
DOTNET_GCDynamicAdaptationMode=0

Measurement is in number of tasks per second. Higher is beter.

===== Baseline:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.016 M     3.122 M     3.027 M     3.048 M     3.037 M     3.014 M     2.992 M     2.997 M     3.003 M
- Depth    2                     3.005 M     2.958 M     3.026 M     3.002 M     2.935 M     2.970 M     2.990 M     2.977 M     3.043 M
- Depth    4                     2.901 M     2.843 M     2.972 M     2.967 M     3.019 M     2.980 M     3.004 M     2.991 M     3.010 M
- Depth    8                     2.965 M     2.919 M     2.902 M     2.734 M     2.902 M     2.922 M     2.913 M     2.927 M     2.937 M
- Depth   16                     2.934 M     2.914 M     2.820 M     2.896 M     2.917 M     2.928 M     2.932 M     2.877 M     2.910 M
- Depth   32                     2.893 M     2.871 M     2.887 M     2.898 M     2.892 M     2.912 M     2.916 M     2.899 M     2.892 M
- Depth   64                     2.894 M     2.867 M     2.888 M     2.881 M     2.877 M     2.887 M     2.879 M     2.883 M     2.915 M
- Depth  128                     2.902 M     2.882 M     2.917 M     2.908 M     2.908 M     2.904 M     2.915 M     2.904 M     2.901 M
- Depth  512                     2.925 M     2.914 M     2.921 M     2.924 M     2.925 M     2.905 M     2.915 M     2.911 M     2.942 M


QUWI Queue Local (TP)            4.799 M     6.223 M    10.593 M    17.118 M    32.025 M    29.635 M    33.384 M    34.646 M    42.084 M
- Depth    2                     6.213 M    10.461 M    14.443 M    21.303 M    32.522 M    38.689 M    39.372 M    40.590 M    43.064 M
- Depth    4                     9.471 M    14.051 M    21.886 M    32.072 M    39.414 M    44.118 M    44.708 M    44.957 M    45.676 M
- Depth    8                    14.232 M    21.544 M    33.413 M    38.438 M    42.951 M    46.537 M    46.760 M    46.946 M    46.824 M
- Depth   16                    21.784 M    33.438 M    37.524 M    41.762 M    45.507 M    47.363 M    47.675 M    47.967 M    48.034 M
- Depth   32                    33.545 M    40.413 M    43.498 M    46.019 M    48.190 M    48.020 M    48.061 M    48.373 M    47.901 M
- Depth   64                    40.034 M    43.087 M    45.389 M    47.332 M    48.465 M    49.146 M    49.223 M    48.879 M    49.355 M
- Depth  128                    42.980 M    46.443 M    47.577 M    48.383 M    48.804 M    48.962 M    48.755 M    49.365 M    49.517 M
- Depth  512                    47.251 M    49.185 M    49.091 M    49.134 M    49.473 M    49.010 M    49.398 M    49.381 M    49.238 M

With this PR:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.063 M     3.083 M     3.047 M     3.063 M     3.003 M     3.033 M     3.037 M     3.011 M     3.026 M
- Depth    2                     3.017 M     2.908 M     2.952 M     3.017 M     3.027 M     2.948 M     2.968 M     3.000 M     3.044 M
- Depth    4                     2.925 M     2.977 M     3.024 M     3.008 M     2.979 M     2.980 M     3.004 M     3.022 M     2.991 M
- Depth    8                     2.990 M     2.927 M     2.821 M     2.721 M     2.915 M     2.951 M     2.942 M     2.915 M     2.975 M
- Depth   16                     2.942 M     2.967 M     2.911 M     2.912 M     3.021 M     2.941 M     2.901 M     2.963 M     2.928 M
- Depth   32                     2.961 M     2.944 M     2.948 M     2.944 M     2.986 M     2.954 M     2.937 M     2.939 M     2.936 M
- Depth   64                     2.968 M     2.952 M     2.948 M     2.960 M     2.959 M     2.944 M     2.949 M     2.951 M     2.956 M
- Depth  128                     2.964 M     2.966 M     2.963 M     2.966 M     2.958 M     2.973 M     2.975 M     2.961 M     2.963 M
- Depth  512                     2.990 M     2.968 M     2.971 M     2.988 M     2.963 M     2.979 M     2.977 M     2.984 M     2.984 M


QUWI Queue Local (TP)            5.492 M    10.456 M    17.804 M    18.884 M    48.797 M   127.532 M   162.716 M   158.277 M   214.897 M
- Depth    2                    11.165 M    19.263 M    17.196 M    29.374 M    76.102 M   161.291 M   160.480 M   178.629 M   209.432 M
- Depth    4                    19.652 M    19.699 M    25.000 M    53.820 M   101.990 M   171.157 M   176.565 M   192.625 M   214.042 M
- Depth    8                    23.519 M    25.474 M    37.569 M    91.185 M   136.847 M   157.626 M   183.437 M   199.642 M   204.375 M
- Depth   16                    27.862 M    41.280 M    76.235 M   118.098 M   159.696 M   200.514 M   197.469 M   209.445 M   211.314 M
- Depth   32                    40.314 M    77.313 M   115.200 M   150.681 M   187.060 M   204.254 M   201.597 M   205.071 M   211.376 M
- Depth   64                    73.297 M   139.082 M   172.258 M   176.718 M   199.829 M   218.152 M   205.455 M   205.482 M   213.325 M
- Depth  128                   133.615 M   176.833 M   186.944 M   199.672 M   205.262 M   207.053 M   201.800 M   211.560 M   215.708 M
- Depth  512                   192.360 M   210.508 M   210.339 M   217.688 M   211.859 M   208.360 M   212.702 M   212.503 M   212.348 M

In QUWI Queue Local we are able to execute a lot more workitems per second.

QUWI No Queues is bottlenecked on the FIFO workitem queue. This PR does not address that part, thus no benefits from concurrency, depth or anything else.

NOTE: this is a microbenchmark! These tasks are very trivial, on the level of "increment a counter".
The benchmark is a good tool for checking on per-task overheads and bottlenecks.
The improvements will vary for actual scenarios where workitems do more work, compared to the benchmark.

@VSadov
Copy link
Member Author

VSadov commented Dec 25, 2025

For reference - the same as above, but with
set DOTNET_ThreadPool_UseWindowsThreadPool=1

In this case the task queue is the same, but the thread management is done by OS.
In particular there is not HIllClimbing and other similar things, thus per-task expenses are a bit less to start with.
(there are downsides with using OS threadpool, but per-task expenses are less, at least currently)

This variant benefits more from the per-workitem improvements in the PR.

=== Baseline:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.075 M     3.068 M     3.061 M     3.077 M     3.030 M     3.082 M     3.058 M     3.056 M     3.045 M
- Depth    2                     3.016 M     3.010 M     3.037 M     2.961 M     3.011 M     3.033 M     2.937 M     2.958 M     3.001 M
- Depth    4                     2.945 M     2.997 M     2.983 M     2.987 M     2.990 M     2.975 M     2.990 M     2.965 M     2.942 M
- Depth    8                     2.963 M     2.909 M     2.953 M     2.977 M     2.912 M     2.991 M     2.947 M     3.001 M     2.977 M
- Depth   16                     2.983 M     2.932 M     2.875 M     2.962 M     2.957 M     2.975 M     2.974 M     2.965 M     2.949 M
- Depth   32                     2.963 M     2.961 M     2.962 M     2.945 M     2.939 M     2.963 M     2.952 M     2.955 M     2.958 M
- Depth   64                     2.951 M     2.955 M     2.957 M     2.962 M     2.947 M     2.959 M     2.944 M     2.957 M     2.960 M
- Depth  128                     2.956 M     2.972 M     2.972 M     2.972 M     2.969 M     2.961 M     2.967 M     2.966 M     2.967 M
- Depth  512                     2.964 M     2.968 M     2.966 M     2.965 M     2.964 M     2.963 M     2.974 M     2.968 M     2.971 M


QUWI Queue Local (TP)            7.631 M    15.943 M    22.686 M    30.837 M    35.689 M    45.171 M    50.381 M    52.533 M    55.251 M
- Depth    2                    12.866 M    21.285 M    32.843 M    27.394 M    39.129 M    51.520 M    52.513 M    53.210 M    54.110 M
- Depth    4                    22.034 M    31.817 M    29.279 M    37.944 M    44.660 M    53.304 M    54.043 M    54.486 M    55.104 M
- Depth    8                    36.834 M    36.869 M    40.448 M    44.281 M    50.549 M    55.034 M    55.973 M    55.923 M    55.973 M
- Depth   16                    45.172 M    40.375 M    44.735 M    48.158 M    52.005 M    56.377 M    56.223 M    56.254 M    56.111 M
- Depth   32                    38.088 M    45.396 M    47.299 M    50.493 M    54.404 M    56.791 M    56.596 M    56.820 M    56.028 M
- Depth   64                    44.804 M    48.991 M    51.455 M    53.755 M    55.848 M    56.931 M    57.380 M    56.553 M    56.975 M
- Depth  128                    48.775 M    51.866 M    53.985 M    55.647 M    56.168 M    57.248 M    57.098 M    57.415 M    56.359 M
- Depth  512                    55.442 M    55.050 M    56.782 M    56.497 M    57.615 M    57.055 M    57.175 M    56.792 M    56.774 M

With this PR:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.063 M     3.110 M     3.060 M     3.041 M     3.050 M     3.056 M     3.057 M     3.069 M     3.038 M
- Depth    2                     3.047 M     3.026 M     3.013 M     3.029 M     2.968 M     2.969 M     3.021 M     3.003 M     3.006 M
- Depth    4                     2.947 M     3.036 M     3.076 M     3.012 M     3.041 M     3.028 M     3.011 M     3.037 M     3.053 M
- Depth    8                     2.974 M     3.015 M     3.012 M     3.025 M     3.020 M     3.026 M     3.009 M     3.004 M     3.059 M
- Depth   16                     3.014 M     3.031 M     2.959 M     3.043 M     3.017 M     3.012 M     3.015 M     3.021 M     3.034 M
- Depth   32                     3.015 M     2.995 M     3.037 M     3.014 M     3.022 M     2.991 M     3.030 M     2.994 M     3.041 M
- Depth   64                     2.998 M     3.024 M     3.023 M     3.004 M     3.025 M     3.019 M     3.015 M     2.992 M     3.000 M
- Depth  128                     3.008 M     3.003 M     3.005 M     3.006 M     3.010 M     3.006 M     3.000 M     3.004 M     3.004 M
- Depth  512                     3.013 M     3.010 M     3.015 M     3.013 M     3.015 M     3.015 M     3.010 M     3.018 M     3.030 M


QUWI Queue Local (TP)            7.077 M    13.085 M    27.234 M    33.498 M    54.361 M   163.725 M   223.814 M   230.576 M   238.356 M
- Depth    2                    13.710 M    24.174 M    37.391 M    43.846 M    84.419 M   239.729 M   234.524 M   247.397 M   248.473 M
- Depth    4                    22.145 M    34.966 M    52.797 M    62.646 M   126.954 M   193.050 M   222.039 M   255.051 M   247.278 M
- Depth    8                    35.400 M    56.054 M    65.657 M    93.659 M   166.261 M   239.907 M   252.275 M   235.815 M   259.223 M
- Depth   16                    55.728 M    70.509 M   101.264 M   141.210 M   196.271 M   258.734 M   261.945 M   257.276 M   259.935 M
- Depth   32                    70.777 M    81.802 M   133.855 M   175.980 M   233.571 M   248.360 M   256.966 M   261.785 M   263.864 M
- Depth   64                    89.869 M   144.095 M   203.279 M   207.784 M   243.898 M   261.001 M   256.780 M   264.602 M   261.545 M
- Depth  128                   146.725 M   198.235 M   212.273 M   229.452 M   254.920 M   259.824 M   261.772 M   261.219 M   262.685 M
- Depth  512                   239.916 M   253.174 M   258.385 M   258.077 M   262.625 M   265.057 M   264.875 M   264.383 M   262.972 M

@VSadov VSadov marked this pull request as ready for review December 25, 2025 01:26
Copilot AI review requested due to automatic review settings December 25, 2025 01:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements performance improvements to the thread pool by reducing per-work-item overhead and minimizing spurious thread wakeups. The primary focus is on optimizing thread scheduling and synchronization while maintaining correctness.

Key changes include:

  • Introducing a single outstanding thread request flag (_hasOutstandingThreadRequest) to replace the counter-based approach, preventing thundering herd issues
  • Reducing memory barriers and fence operations in critical paths where volatile semantics provide sufficient guarantees
  • Simplifying semaphore spinning logic and removing spinner tracking overhead
  • Implementing exponential backoff for contended interlocked operations
  • Deferring work queue assignment to reduce lock contention during dispatch startup

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
WindowsThreadPool.cs Adds CacheLineSeparated struct with thread request flag and implements EnsureWorkerRequested() with check-exchange pattern to prevent duplicate requests
ThreadPoolWorkQueue.cs Removes unnecessary volatile writes in LocalPush, eliminates _mayHaveHighPriorityWorkItems flag, defers queue assignment, and updates thread request calls
ThreadPool.Windows.cs Changes YieldFromDispatchLoop to accept currentTickCount parameter and renames RequestWorkerThread to EnsureWorkerRequested
ThreadPool.Unix.cs Updates YieldFromDispatchLoop signature to call NotifyDispatchProgress and renames RequestWorkerThread
ThreadPool.Wasi.cs Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
ThreadPool.Browser.cs Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
ThreadPool.Browser.Threads.cs Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
PortableThreadPool.cs Renames lastDequeueTime to lastDispatchTime, replaces numRequestedWorkers with _hasOutstandingThreadRequest, refactors NotifyWorkItemProgress methods
PortableThreadPool.WorkerThread.cs Reduces semaphore spin count from 70 to 9, refactors WorkerDoWork to use check-exchange pattern, implements TryRemoveWorkingWorker with overflow handling
PortableThreadPool.ThreadCounts.cs Adds IsOverflow property and TryIncrement/DecrementProcessingWork methods to manage overflow state using high bit of _data
PortableThreadPool.GateThread.cs Updates starvation detection to use _hasOutstandingThreadRequest and lastDispatchTime
PortableThreadPool.Blocking.cs Updates blocking adjustment logic to check _hasOutstandingThreadRequest
LowLevelLifoSemaphore.cs Removes spinner tracking, changes Release(count) to Signal(), simplifies Wait() to remove spinWait parameter, restructures Counts bit layout
Backoff.cs Introduces new exponential backoff utility using stack address-based randomization for collision retry scenarios
System.Private.CoreLib.Shared.projitems Adds Backoff.cs to project compilation

VSadov and others added 2 commits December 24, 2025 22:29
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant