Skip to content

harshal24-chavan/SPSC-Queue-CPP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Technical Brief: High-Performance SPSC Ring Buffer

1. The Core Architecture (SPSC)

  • Concept: Single-Producer Single-Consumer (SPSC).
  • Why: Unlike Multi-Producer queues that require expensive lock instructions or compare_exchange loops, SPSC allows for Wait-Free operations.
  • Soution: Using 2 atomic Indexes (write_pos & read_pos), producer modifies write_pos & consumer modifies read_pos, removing contention.

2. Mechanical Sympathy & CPU Caching

  • False Sharing: variables on same 64 bit cache line can cause our threads to flush out there old caches and refill them with updated values. When one core writes, it invalidates the other core's cache.
  • Fix: We use alignas(64) to ensure write_pos and read_pos are on different cache line.
  • Cache Locality: We sized the array to be 2^14 = 16384 = ~256KB. This ensures that the data stays in L2 Cache on (i3-540) avoiding reaching out to Main Ram.

3. BitMasking

  • Modulo (%): Modulo operation requires integer division, which takes multiple cpu cycles (20-80 cycles).
  • Solution: Making the buffer size to be multiple of 2, we can use bitmask with AND (&) with mask Size - 1.
  • Performance Gain: Reduces the wrap-around calculation to 1 clock cycle.

4. C++ Memory Model (Acquire - Release)

  • Memory Ordering: Avoiding default std::memory_order_seq_cst (too slow) for Acquire-Release semantics.
  • Handshake:
    • Producer: Uses std::memory_order_release when storing write_pos. Ensuring all data writes are visible to consumer before index update.
    • Consumer: Uses std::memory_order_acquire when loading write_pos. This ensures it "sees" the data the producer just wrote.
  • Relaxed Loads: Both Producer and Consumer load their respective atomic variables using std::memory_order_relaxed as no cross-thread synchronization is required.
  • Note:
    • memory_order_acquire: When you perform a Load with acquire, you are saying: "I am waiting for a signal. Once I see the value I'm looking for, I must be able to see everything that the 'releasing' thread did before it sent that signal."
      • Analogy: Opening the suitcase. You are guaranteed to see exactly what was packed inside.
    • memory_order_release: When you perform a Store with release, you are creating a "barrier." You are saying: "Everything I did before this store must be visible to anyone who 'acquires' this variable."
      • Analogy: Packing a suitcase and then locking it. The "lock" is the release operation.

5. Performance Engineering Results

  • Thread Affinity: Pinning threads to specific physical cores (Core0 and Core1) is vital. Without it the OS schedular moves around the threads, causing cache misses and dropping throughput from 255M to 37M.
  • The "3.9 Nanosecond" Win: We achieved a synchronization cost lower than 5ns, meaning the queue is faster than the hardware's ability to even move data to the next cache level.

6. Memory Management (Heap vs. Stack)

  • Constraint: On systems with limited RAM/Stack (like the 3.75GiB iMac), a 1MB buffer can cause a stack overflow.
  • Solution: We used std::unique_ptr with std::make_unique<std::array<T, N>> to keep the large buffer on the Heap while keeping the SPSCQueue object itself tiny and "stack-safe."

7. Performance Benchmarking (The Evidence)

The "Pinned" Benchmark (Mechanical Sympathy)

  • Setup: We used pthread_setaffinity_np to pin the Producer to Core 0 and the Consumer to Core 1.
  • Result: ~4.43 ns per operation (225 Million items/sec).
  • Insight: When threads are pinned, they stay on their physical cores. The data travels through the L3 Cache (the "Commuter Rail" of the CPU) without the overhead of the OS moving threads around. This is the theoretical limit of the i3-540 hardware.

The "Unpinned" Benchmark (The Cost of Abstraction)

  • Setup: We removed the affinity code and let the Linux Scheduler decide where to run the threads.
  • Result: ~26.7 ns per operation (37 Million items/sec).
  • Insight: This 6x slowdown occurs because the OS occasionally moves both threads to the same physical core (Hyper-threading contention) or migrates them to different cores, causing L1/L2 Cache Invalidation.

Key Takeaway: Even "lock-free" code can be slow if you don't account for how the OS manages CPU resources.

8. Tooling & Optimization Flags

  • Compiler: g++ -O3. Optimization level 3 is mandatory for lock-free code to allow the compiler to reorder non-atomic instructions for maximum pipeline efficiency.
  • Memory Check: We used std::is_trivially_copyable to ensure our SyncTask could be moved using simple bit-copies (memcpy style), which is the fastest way to move data in a ring buffer.
  • Profiling: We used perf stat on Arch Linux to monitor Instructions Per Cycle (IPC) and Cache-Misses, confirming that our alignas(64) padding successfully reduced cache contention.

9. How to run:

  • Using the build script:
chmod +x build.sh && ./build.sh

🚀 Performance Benchmarks (Intel i3-540 @ 3.06GHz, 3.75GB Ram)

Measured using Google Benchmark on Arch Linux (Kernel 6.18):

Configuration Latency Throughput
Pinned (Core 0/1) 4.09 ns 243.9M items/s
Unpinned (OS Default) 26.7 ns 37.4M items/s

📚 Credits & Inspiration

This project is a practical implementation of concepts discussed in the following CppCon sessions:

  • Fedor Pikus: "C++ atomics, from basic to advanced. What do they really do?"
  • Charles Frasch: "Single Producer Single Consumer Lock-free FIFO From the Ground Up"

⚖️ License

Distributed under the MIT License. See LICENSE for more information.

About

A wait-free SPSC (Single-Producer Single-Consumer) lock-free queue. Achieving 225M+ ops/s through cache-line alignment and acquire-release semantics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors