A research-grade C++ project exploring sequential vs parallel performance across core algorithmic domains
This project is a comprehensive study of parallel algorithm design and performance using modern C++ (C++20) and OpenMP. It benchmarks graph traversal, matrix multiplication, and sorting algorithms, highlighting how algorithmic structure, memory access, and parallelization strategies affect real-world performance.
Modern CPUs derive performance not from higher clock speeds, but from parallelism:
- Multi-core execution
- Cache hierarchies
- Memory locality
This project was built to:
- Understand why parallel code speeds up (or doesn’t)
- Compare sequential vs parallel implementations fairly
- Apply theory → systems-level reality
parallel_algorithm/
│
├── src/
│ ├── graph/
│ │ ├── graph.h
│ │ └── bfs.{h,cpp}
│ │
│ ├── matrix/
│ │ ├── matmul.h
│ │ ├── matmul_seq.cpp
│ │ └── matmul_parallel.cpp
│ │
│ ├── sort/
│ │ ├── mergesort.h
│ │ ├── mergesort_seq.cpp
│ │ └── mergesort_parallel.cpp
│ │
│ ├── utils/
│ │ ├── timer.h
│ │ └── graph_utils.h
│ │
│ └── main.cpp
│
├── CMakeLists.txt
└── README.md
- Language: C++20
- Parallelism: OpenMP
- Build System: CMake
- Compiler: GCC / MinGW (OpenMP enabled)
- Platform: Linux / Windows
Goal: Measure the impact of parallelism on irregular workloads.
- Sequential BFS
- Parallel BFS using frontier-based expansion
🔍 Key Insight:
BFS is often memory-bound, so speedup depends heavily on graph structure and cache behavior.
- Sequential Naïve (O(N³))
- Blocked (Cache-Aware) Parallel MatMul
- Loop blocking for L1/L2 cache locality
#pragma omp parallel for collapse(2)- Avoided false sharing by careful loop ordering
🔍 Why Blocking Matters:
Naïve matrix multiplication causes massive cache misses. Blocking ensures that submatrices stay hot in cache, resulting in:
- 10–100× speedup vs naïve sequential
- Near-linear scaling up to core count
- Sequential Merge Sort
- Parallel Merge Sort using OpenMP
sections
- Depth-limited parallel recursion
- Parallel work only at higher recursion levels
- Sequential fallback to reduce overhead
🔍 Key Insight:
Uncontrolled parallel recursion slows down performance due to thread oversubscription.
- High-resolution wall-clock timer (
std::chrono) - Fixed random seeds for reproducibility
- Identical input data across runs
- Multiple thread counts tested:
Threads: 1, 2, 4, 8, 12
Parallel Matrix Multiply:
Threads: 1 | Time: 602112 ms | Speedup: 1.00
Threads: 2 | Time: 315420 ms | Speedup: 1.92
Threads: 4 | Time: 162831 ms | Speedup: 3.73
Threads: 8 | Time: 89120 ms | Speedup: 6.82
Parallel BFS traversal:
Thread: 1 | Time: 34.9509 ms
Thread: 2 | Time: 19.7291 ms
Thread: 4 | Time: 12.3914 ms
Thread: 8 | Time: 7.9011 ms
Thread: 12 | Time: 6.9935 ms
Parallel Merge Sorting:
Threads: 1 | Time: 682.306 ms | Speedup: 1.26748
Threads: 2 | Time: 932.408 ms | Speedup: 0.9275
Threads: 4 | Time: 601.601 ms | Speedup: 1.43751
Threads: 8 | Time: 621.919 ms | Speedup: 1.39055
- Parallelism ≠ automatic speedup
- Memory locality is as important as thread count
- Over-parallelization can degrade performance
- OpenMP is powerful but must be controlled
- No SIMD intrinsics (AVX) yet
- NUMA effects not explicitly handled
- Sorting performance sensitive to input size
mkdir build
cd build
cmake ..
cmake --build .
./parallel_algorithmEnsure OpenMP support is enabled in your compiler.
- SIMD-optimized matrix multiplication (AVX2/AVX-512)
- Task-based parallelism (
omp task) - NUMA-aware allocation
- CSV benchmark export + plots
This project demonstrates:
- Parallel algorithm design
- Systems-level performance thinking
- Cache-aware programming
- OpenMP mastery
📌 Ideal for:
- Systems / HPC interviews
- Research internships
- Performance engineering roles
MIT License — free to use, modify, and distribute.
"Fast code is not written — it is engineered." 🔥