- Concept: Single-Producer Single-Consumer (SPSC).
- Why: Unlike Multi-Producer queues that require expensive lock instructions or compare_exchange loops, SPSC allows for Wait-Free operations.
- Soution: Using 2 atomic Indexes (write_pos & read_pos), producer modifies write_pos & consumer modifies read_pos, removing contention.
- False Sharing: variables on same 64 bit cache line can cause our threads to flush out there old caches and refill them with updated values. When one core writes, it invalidates the other core's cache.
- Fix: We use
alignas(64)to ensurewrite_posandread_posare on different cache line. - Cache Locality: We sized the array to be
2^14 = 16384 = ~256KB. This ensures that the data stays in L2 Cache on (i3-540) avoiding reaching out to Main Ram.
- Modulo (%): Modulo operation requires integer division, which takes multiple cpu cycles (20-80 cycles).
- Solution: Making the buffer size to be multiple of 2, we can use bitmask with
AND (&)with maskSize - 1. - Performance Gain: Reduces the wrap-around calculation to 1 clock cycle.
- Memory Ordering: Avoiding default std::memory_order_seq_cst (too slow) for Acquire-Release semantics.
- Handshake:
- Producer: Uses
std::memory_order_releasewhen storing write_pos. Ensuring all data writes are visible to consumer before index update. - Consumer: Uses
std::memory_order_acquirewhen loading write_pos. This ensures it "sees" the data the producer just wrote.
- Producer: Uses
- Relaxed Loads: Both Producer and Consumer load their respective atomic variables using
std::memory_order_relaxedas no cross-thread synchronization is required. - Note:
- memory_order_acquire: When you perform a Load with
acquire, you are saying: "I am waiting for a signal. Once I see the value I'm looking for, I must be able to see everything that the 'releasing' thread did before it sent that signal."- Analogy: Opening the suitcase. You are guaranteed to see exactly what was packed inside.
- memory_order_release: When you perform a Store with
release, you are creating a "barrier." You are saying: "Everything I did before this store must be visible to anyone who 'acquires' this variable."- Analogy: Packing a suitcase and then locking it. The "lock" is the release operation.
- memory_order_acquire: When you perform a Load with
- Thread Affinity: Pinning threads to specific physical cores (Core0 and Core1) is vital. Without it the OS schedular moves around the threads, causing cache misses and dropping throughput from 255M to 37M.
- The "3.9 Nanosecond" Win: We achieved a synchronization cost lower than 5ns, meaning the queue is faster than the hardware's ability to even move data to the next cache level.
- Constraint: On systems with limited RAM/Stack (like the 3.75GiB iMac), a 1MB buffer can cause a stack overflow.
- Solution: We used std::unique_ptr with std::make_unique<std::array<T, N>> to keep the large buffer on the Heap while keeping the SPSCQueue object itself tiny and "stack-safe."
- Setup: We used pthread_setaffinity_np to pin the Producer to Core 0 and the Consumer to Core 1.
- Result: ~4.43 ns per operation (225 Million items/sec).
- Insight: When threads are pinned, they stay on their physical cores. The data travels through the L3 Cache (the "Commuter Rail" of the CPU) without the overhead of the OS moving threads around. This is the theoretical limit of the i3-540 hardware.
- Setup: We removed the affinity code and let the Linux Scheduler decide where to run the threads.
- Result: ~26.7 ns per operation (37 Million items/sec).
- Insight: This 6x slowdown occurs because the OS occasionally moves both threads to the same physical core (Hyper-threading contention) or migrates them to different cores, causing L1/L2 Cache Invalidation.
Key Takeaway: Even "lock-free" code can be slow if you don't account for how the OS manages CPU resources.
- Compiler: g++ -O3. Optimization level 3 is mandatory for lock-free code to allow the compiler to reorder non-atomic instructions for maximum pipeline efficiency.
- Memory Check: We used std::is_trivially_copyable to ensure our SyncTask could be moved using simple bit-copies (memcpy style), which is the fastest way to move data in a ring buffer.
- Profiling: We used perf stat on Arch Linux to monitor Instructions Per Cycle (IPC) and Cache-Misses, confirming that our alignas(64) padding successfully reduced cache contention.
- Using the build script:
chmod +x build.sh && ./build.shMeasured using Google Benchmark on Arch Linux (Kernel 6.18):
| Configuration | Latency | Throughput |
|---|---|---|
| Pinned (Core 0/1) | 4.09 ns | 243.9M items/s |
| Unpinned (OS Default) | 26.7 ns | 37.4M items/s |
- Fedor Pikus: "C++ atomics, from basic to advanced. What do they really do?"
- Charles Frasch: "Single Producer Single Consumer Lock-free FIFO From the Ground Up"
Distributed under the MIT License. See LICENSE for more information.