Skip to content

perf: cache-line alignment, lock-free mutex, batch steal, io_uring tu…#34

Merged
Coldwings merged 1 commit intomainfrom
perf/cache-line-alignment-io-uring-tuning
Mar 11, 2026
Merged

perf: cache-line alignment, lock-free mutex, batch steal, io_uring tu…#34
Coldwings merged 1 commit intomainfrom
perf/cache-line-alignment-io-uring-tuning

Conversation

@Coldwings
Copy link
Owner

…ning

Performance improvements across scheduler, sync primitives, coroutine infrastructure and I/O backend. All 840 tests pass; ASAN (1896 assertions) and TSAN (1895 assertions) are clean.

scheduler (runtime/scheduler.hpp)

  • Place num_threads_, running_, paused_ on separate cache lines with alignas(64) to eliminate false sharing between cores that frequently read these hot flags.
  • Isolate spawn_index_ (written on every spawn()) onto its own cache line so round-robin counter updates do not invalidate the running_/num_threads_ line on other cores.
  • Move workers_mutex_ (slow-path resize only) onto its own cache line.

sync::mutex (sync/primitives.hpp)

  • Replace std::mutex + std::queue<coroutine_handle> with a single atomic<void*> that encodes the full lock state as an intrusive LIFO waiter stack. Uncontended lock/unlock is now a single CAS (~3 cycles) with zero heap allocation and no OS mutex involvement.
  • Sentinel value locked_no_waiters() == this distinguishes 'locked with no waiters' from any lock_awaitable* stored in the stack.
  • Waiters chain themselves lock-free via await_suspend(); unlock() pops the head and schedules it through the coroutine scheduler.

sync::shared_mutex (sync/primitives.hpp)

  • Isolate the hot state_ atomic (read on every lock_shared() fast path) onto its own cache line with alignas(64), separate from the slow-path internal_mutex_ and waiter queues.

coro::join_state (coro/task.hpp)

  • Align waiter_ / completed_ to a new 64-byte cache line, separating the hot synchronisation atomics from the cold value_ / exception_ storage that is written only once.

chase_lev_deque (runtime/chase_lev_deque.hpp)

  • Replace the deprecated (and broken) steal_batch() with a safe implementation using a single CAS on top_ to atomically claim min(available/2, N) slots. Items are loaded only after exclusive ownership is established, eliminating the pop() race. Cuts per-item CAS cost by up to N× when the queue is deep.

frame_allocator (coro/frame_allocator.hpp)

  • Add architecture-aware ELIO_CPU_PAUSE() macro (x86 PAUSE, AArch64 YIELD, fallback std::this_thread::yield()) used in MPSC spin loops.
  • Reduce reclaim_remote_returns() spin limit from 100 to 16 iterations and reclaim_all_remote_returns() from 1000 to 32 iterations.
  • Fix latent use-after-free hazard: when the node link is still unready after the spin limit, stop iteration rather than consuming the partially-linked node (which the producer would later write through a recycled pointer).

io_uring_backend (io/io_uring_backend.hpp)

  • Increase kResumeTrackingShards from 16 to 64 to accommodate 64-core machines without measurable mutex contention.
  • Change config::queue_depth default from 256 to clamp(hw_threads * 512, 1024, 32768), reducing SQE exhaustion under high concurrency while staying memory-efficient on small machines.

…ning

Performance improvements across scheduler, sync primitives, coroutine
infrastructure and I/O backend.  All 840 tests pass; ASAN (1896 assertions)
and TSAN (1895 assertions) are clean.

## scheduler (runtime/scheduler.hpp)
- Place num_threads_, running_, paused_ on separate cache lines with
  alignas(64) to eliminate false sharing between cores that frequently
  read these hot flags.
- Isolate spawn_index_ (written on every spawn()) onto its own cache
  line so round-robin counter updates do not invalidate the
  running_/num_threads_ line on other cores.
- Move workers_mutex_ (slow-path resize only) onto its own cache line.

## sync::mutex (sync/primitives.hpp)
- Replace std::mutex + std::queue<coroutine_handle> with a single
  atomic<void*> that encodes the full lock state as an intrusive LIFO
  waiter stack.  Uncontended lock/unlock is now a single CAS (~3 cycles)
  with zero heap allocation and no OS mutex involvement.
- Sentinel value locked_no_waiters() == this distinguishes 'locked with
  no waiters' from any lock_awaitable* stored in the stack.
- Waiters chain themselves lock-free via await_suspend(); unlock() pops
  the head and schedules it through the coroutine scheduler.

## sync::shared_mutex (sync/primitives.hpp)
- Isolate the hot state_ atomic (read on every lock_shared() fast path)
  onto its own cache line with alignas(64), separate from the
  slow-path internal_mutex_ and waiter queues.

## coro::join_state (coro/task.hpp)
- Align waiter_ / completed_ to a new 64-byte cache line, separating
  the hot synchronisation atomics from the cold value_ / exception_
  storage that is written only once.

## chase_lev_deque (runtime/chase_lev_deque.hpp)
- Replace the deprecated (and broken) steal_batch() with a safe
  implementation using a single CAS on top_ to atomically claim
  min(available/2, N) slots.  Items are loaded only after exclusive
  ownership is established, eliminating the pop() race.
  Cuts per-item CAS cost by up to N× when the queue is deep.

## frame_allocator (coro/frame_allocator.hpp)
- Add architecture-aware ELIO_CPU_PAUSE() macro (x86 PAUSE, AArch64
  YIELD, fallback std::this_thread::yield()) used in MPSC spin loops.
- Reduce reclaim_remote_returns() spin limit from 100 to 16 iterations
  and reclaim_all_remote_returns() from 1000 to 32 iterations.
- Fix latent use-after-free hazard: when the node link is still unready
  after the spin limit, stop iteration rather than consuming the
  partially-linked node (which the producer would later write through a
  recycled pointer).

## io_uring_backend (io/io_uring_backend.hpp)
- Increase kResumeTrackingShards from 16 to 64 to accommodate 64-core
  machines without measurable mutex contention.
- Change config::queue_depth default from 256 to
  clamp(hw_threads * 512, 1024, 32768), reducing SQE exhaustion under
  high concurrency while staying memory-efficient on small machines.
@Coldwings Coldwings merged commit 6f38b35 into main Mar 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant