-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[ThreadPool] Scalability Experiments follow up. Part1 #122726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Tagging subscribers to this area: @mangod9 |
|
As a benchmark to measure per-task overhead I use a subset from the following benchmark https://github.com/benaadams/ThreadPoolTaskTesting Results as measured on Win11 with AMD 7950X 16-Core The following is set to reduce possible noise: Measurement is in number of tasks per second. Higher is beter. ===== Baseline: With this PR: In
NOTE: this is a microbenchmark! These tasks are very trivial, on the level of "increment a counter". |
|
For reference - the same as above, but with In this case the task queue is the same, but the thread management is done by OS. This variant benefits more from the per-workitem improvements in the PR. === Baseline: With this PR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements performance improvements to the thread pool by reducing per-work-item overhead and minimizing spurious thread wakeups. The primary focus is on optimizing thread scheduling and synchronization while maintaining correctness.
Key changes include:
- Introducing a single outstanding thread request flag (
_hasOutstandingThreadRequest) to replace the counter-based approach, preventing thundering herd issues - Reducing memory barriers and fence operations in critical paths where volatile semantics provide sufficient guarantees
- Simplifying semaphore spinning logic and removing spinner tracking overhead
- Implementing exponential backoff for contended interlocked operations
- Deferring work queue assignment to reduce lock contention during dispatch startup
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| WindowsThreadPool.cs | Adds CacheLineSeparated struct with thread request flag and implements EnsureWorkerRequested() with check-exchange pattern to prevent duplicate requests |
| ThreadPoolWorkQueue.cs | Removes unnecessary volatile writes in LocalPush, eliminates _mayHaveHighPriorityWorkItems flag, defers queue assignment, and updates thread request calls |
| ThreadPool.Windows.cs | Changes YieldFromDispatchLoop to accept currentTickCount parameter and renames RequestWorkerThread to EnsureWorkerRequested |
| ThreadPool.Unix.cs | Updates YieldFromDispatchLoop signature to call NotifyDispatchProgress and renames RequestWorkerThread |
| ThreadPool.Wasi.cs | Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning |
| ThreadPool.Browser.cs | Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning |
| ThreadPool.Browser.Threads.cs | Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning |
| PortableThreadPool.cs | Renames lastDequeueTime to lastDispatchTime, replaces numRequestedWorkers with _hasOutstandingThreadRequest, refactors NotifyWorkItemProgress methods |
| PortableThreadPool.WorkerThread.cs | Reduces semaphore spin count from 70 to 9, refactors WorkerDoWork to use check-exchange pattern, implements TryRemoveWorkingWorker with overflow handling |
| PortableThreadPool.ThreadCounts.cs | Adds IsOverflow property and TryIncrement/DecrementProcessingWork methods to manage overflow state using high bit of _data |
| PortableThreadPool.GateThread.cs | Updates starvation detection to use _hasOutstandingThreadRequest and lastDispatchTime |
| PortableThreadPool.Blocking.cs | Updates blocking adjustment logic to check _hasOutstandingThreadRequest |
| LowLevelLifoSemaphore.cs | Removes spinner tracking, changes Release(count) to Signal(), simplifies Wait() to remove spinWait parameter, restructures Counts bit layout |
| Backoff.cs | Introduces new exponential backoff utility using stack address-based randomization for collision retry scenarios |
| System.Private.CoreLib.Shared.projitems | Adds Backoff.cs to project compilation |
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.ThreadCounts.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.ThreadCounts.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This is a follow up on recommendations from Scalability Experiments done some time ago.
The Scalability Experiments resulted in many suggestions. In this part we look at overheads of submitting and executing a workitem to the threadpool from the thread scheduling point of view. In particular - this PR tries to minimize changes to the workqueue to scope the changes.
The workqueue related recommendations will be addressed separately.
The threadpool parts are very interconnected though, and sometimes removing one bottleneck results in another one to show up, so some workqueue changes had to be done, just to avoid regressions.
There are also a few "low hanging fruit" fixes for per-workitem overheads like unnecessary fences or too frequent modifications of shared state.
Hopefully this will negate some of the regressions from #121887 (as was reported in #122186)
In this change:
fewer operations per work item where possible.
such as fewer/weaker fences where possible, reporting heartbeat once per dispatch quantum vs. per each workitem, etc..
avoid spurious wakes of worker threads. (except, unavoidably, when thread goal is changed - by HillClimb and such).
only one thread is requested at a time. requesting another thread is conditioned on evidence of work present in the queue (basically the minimum required for correctness).
as a result a thread that becomes active typically finds work.
in particular this avoids a cascade of spurious wakes when pool is running out of workitems.
stop tracking spinners in LIFO semaphore.
we can keep track of spinners, but informational value of knowing the spinner count is close to zero, so we should not.
no Sleep in LIFO semaphore.
using spin-Sleep is questionable in a synchronization feature that can block and ask OS to wake a thread deterministically.
shortening spinning in the LIFO semaphore to a more affordable value.
since the LIFO semaphore can perform a blocking wait until condition it wants to see, once spinning gets into the range of wait/wake latency, it makes no sense to spin for much longer.
it is also not uncommon that the work is introduced by non-pool threads, thus the pool threads may need to start blocking in order to see more work.