Conversation
…ning Performance improvements across scheduler, sync primitives, coroutine infrastructure and I/O backend. All 840 tests pass; ASAN (1896 assertions) and TSAN (1895 assertions) are clean. ## scheduler (runtime/scheduler.hpp) - Place num_threads_, running_, paused_ on separate cache lines with alignas(64) to eliminate false sharing between cores that frequently read these hot flags. - Isolate spawn_index_ (written on every spawn()) onto its own cache line so round-robin counter updates do not invalidate the running_/num_threads_ line on other cores. - Move workers_mutex_ (slow-path resize only) onto its own cache line. ## sync::mutex (sync/primitives.hpp) - Replace std::mutex + std::queue<coroutine_handle> with a single atomic<void*> that encodes the full lock state as an intrusive LIFO waiter stack. Uncontended lock/unlock is now a single CAS (~3 cycles) with zero heap allocation and no OS mutex involvement. - Sentinel value locked_no_waiters() == this distinguishes 'locked with no waiters' from any lock_awaitable* stored in the stack. - Waiters chain themselves lock-free via await_suspend(); unlock() pops the head and schedules it through the coroutine scheduler. ## sync::shared_mutex (sync/primitives.hpp) - Isolate the hot state_ atomic (read on every lock_shared() fast path) onto its own cache line with alignas(64), separate from the slow-path internal_mutex_ and waiter queues. ## coro::join_state (coro/task.hpp) - Align waiter_ / completed_ to a new 64-byte cache line, separating the hot synchronisation atomics from the cold value_ / exception_ storage that is written only once. ## chase_lev_deque (runtime/chase_lev_deque.hpp) - Replace the deprecated (and broken) steal_batch() with a safe implementation using a single CAS on top_ to atomically claim min(available/2, N) slots. Items are loaded only after exclusive ownership is established, eliminating the pop() race. Cuts per-item CAS cost by up to N× when the queue is deep. ## frame_allocator (coro/frame_allocator.hpp) - Add architecture-aware ELIO_CPU_PAUSE() macro (x86 PAUSE, AArch64 YIELD, fallback std::this_thread::yield()) used in MPSC spin loops. - Reduce reclaim_remote_returns() spin limit from 100 to 16 iterations and reclaim_all_remote_returns() from 1000 to 32 iterations. - Fix latent use-after-free hazard: when the node link is still unready after the spin limit, stop iteration rather than consuming the partially-linked node (which the producer would later write through a recycled pointer). ## io_uring_backend (io/io_uring_backend.hpp) - Increase kResumeTrackingShards from 16 to 64 to accommodate 64-core machines without measurable mutex contention. - Change config::queue_depth default from 256 to clamp(hw_threads * 512, 1024, 32768), reducing SQE exhaustion under high concurrency while staying memory-efficient on small machines.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…ning
Performance improvements across scheduler, sync primitives, coroutine infrastructure and I/O backend. All 840 tests pass; ASAN (1896 assertions) and TSAN (1895 assertions) are clean.
scheduler (runtime/scheduler.hpp)
sync::mutex (sync/primitives.hpp)
sync::shared_mutex (sync/primitives.hpp)
coro::join_state (coro/task.hpp)
chase_lev_deque (runtime/chase_lev_deque.hpp)
frame_allocator (coro/frame_allocator.hpp)
io_uring_backend (io/io_uring_backend.hpp)