perf: cache-line alignment, lock-free mutex, batch steal, io_uring tu… by Coldwings · Pull Request #34 · Coldwings/Elio

Coldwings · 2026-03-11T04:34:33Z

…ning

Performance improvements across scheduler, sync primitives, coroutine infrastructure and I/O backend. All 840 tests pass; ASAN (1896 assertions) and TSAN (1895 assertions) are clean.

scheduler (runtime/scheduler.hpp)

Place num_threads_, running_, paused_ on separate cache lines with alignas(64) to eliminate false sharing between cores that frequently read these hot flags.
Isolate spawn_index_ (written on every spawn()) onto its own cache line so round-robin counter updates do not invalidate the running_/num_threads_ line on other cores.
Move workers_mutex_ (slow-path resize only) onto its own cache line.

sync::mutex (sync/primitives.hpp)

Replace std::mutex + std::queue<coroutine_handle> with a single atomic<void*> that encodes the full lock state as an intrusive LIFO waiter stack. Uncontended lock/unlock is now a single CAS (~3 cycles) with zero heap allocation and no OS mutex involvement.
Sentinel value locked_no_waiters() == this distinguishes 'locked with no waiters' from any lock_awaitable* stored in the stack.
Waiters chain themselves lock-free via await_suspend(); unlock() pops the head and schedules it through the coroutine scheduler.

sync::shared_mutex (sync/primitives.hpp)

Isolate the hot state_ atomic (read on every lock_shared() fast path) onto its own cache line with alignas(64), separate from the slow-path internal_mutex_ and waiter queues.

coro::join_state (coro/task.hpp)

Align waiter_ / completed_ to a new 64-byte cache line, separating the hot synchronisation atomics from the cold value_ / exception_ storage that is written only once.

chase_lev_deque (runtime/chase_lev_deque.hpp)

Replace the deprecated (and broken) steal_batch() with a safe implementation using a single CAS on top_ to atomically claim min(available/2, N) slots. Items are loaded only after exclusive ownership is established, eliminating the pop() race. Cuts per-item CAS cost by up to N× when the queue is deep.

frame_allocator (coro/frame_allocator.hpp)

Add architecture-aware ELIO_CPU_PAUSE() macro (x86 PAUSE, AArch64 YIELD, fallback std::this_thread::yield()) used in MPSC spin loops.
Reduce reclaim_remote_returns() spin limit from 100 to 16 iterations and reclaim_all_remote_returns() from 1000 to 32 iterations.
Fix latent use-after-free hazard: when the node link is still unready after the spin limit, stop iteration rather than consuming the partially-linked node (which the producer would later write through a recycled pointer).

io_uring_backend (io/io_uring_backend.hpp)

Increase kResumeTrackingShards from 16 to 64 to accommodate 64-core machines without measurable mutex contention.
Change config::queue_depth default from 256 to clamp(hw_threads * 512, 1024, 32768), reducing SQE exhaustion under high concurrency while staying memory-efficient on small machines.

…ning Performance improvements across scheduler, sync primitives, coroutine infrastructure and I/O backend. All 840 tests pass; ASAN (1896 assertions) and TSAN (1895 assertions) are clean. ## scheduler (runtime/scheduler.hpp) - Place num_threads_, running_, paused_ on separate cache lines with alignas(64) to eliminate false sharing between cores that frequently read these hot flags. - Isolate spawn_index_ (written on every spawn()) onto its own cache line so round-robin counter updates do not invalidate the running_/num_threads_ line on other cores. - Move workers_mutex_ (slow-path resize only) onto its own cache line. ## sync::mutex (sync/primitives.hpp) - Replace std::mutex + std::queue<coroutine_handle> with a single atomic<void*> that encodes the full lock state as an intrusive LIFO waiter stack. Uncontended lock/unlock is now a single CAS (~3 cycles) with zero heap allocation and no OS mutex involvement. - Sentinel value locked_no_waiters() == this distinguishes 'locked with no waiters' from any lock_awaitable* stored in the stack. - Waiters chain themselves lock-free via await_suspend(); unlock() pops the head and schedules it through the coroutine scheduler. ## sync::shared_mutex (sync/primitives.hpp) - Isolate the hot state_ atomic (read on every lock_shared() fast path) onto its own cache line with alignas(64), separate from the slow-path internal_mutex_ and waiter queues. ## coro::join_state (coro/task.hpp) - Align waiter_ / completed_ to a new 64-byte cache line, separating the hot synchronisation atomics from the cold value_ / exception_ storage that is written only once. ## chase_lev_deque (runtime/chase_lev_deque.hpp) - Replace the deprecated (and broken) steal_batch() with a safe implementation using a single CAS on top_ to atomically claim min(available/2, N) slots. Items are loaded only after exclusive ownership is established, eliminating the pop() race. Cuts per-item CAS cost by up to N× when the queue is deep. ## frame_allocator (coro/frame_allocator.hpp) - Add architecture-aware ELIO_CPU_PAUSE() macro (x86 PAUSE, AArch64 YIELD, fallback std::this_thread::yield()) used in MPSC spin loops. - Reduce reclaim_remote_returns() spin limit from 100 to 16 iterations and reclaim_all_remote_returns() from 1000 to 32 iterations. - Fix latent use-after-free hazard: when the node link is still unready after the spin limit, stop iteration rather than consuming the partially-linked node (which the producer would later write through a recycled pointer). ## io_uring_backend (io/io_uring_backend.hpp) - Increase kResumeTrackingShards from 16 to 64 to accommodate 64-core machines without measurable mutex contention. - Change config::queue_depth default from 256 to clamp(hw_threads * 512, 1024, 32768), reducing SQE exhaustion under high concurrency while staying memory-efficient on small machines.

Coldwings merged commit 6f38b35 into main Mar 11, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache-line alignment, lock-free mutex, batch steal, io_uring tu…#34

perf: cache-line alignment, lock-free mutex, batch steal, io_uring tu…#34
Coldwings merged 1 commit intomainfrom
perf/cache-line-alignment-io-uring-tuning

Coldwings commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Coldwings commented Mar 11, 2026

scheduler (runtime/scheduler.hpp)

sync::mutex (sync/primitives.hpp)

sync::shared_mutex (sync/primitives.hpp)

coro::join_state (coro/task.hpp)

chase_lev_deque (runtime/chase_lev_deque.hpp)

frame_allocator (coro/frame_allocator.hpp)

io_uring_backend (io/io_uring_backend.hpp)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant