🚀 Parallel Algorithms Benchmark Suite

A research-grade C++ project exploring sequential vs parallel performance across core algorithmic domains

This project is a comprehensive study of parallel algorithm design and performance using modern C++ (C++20) and OpenMP. It benchmarks graph traversal, matrix multiplication, and sorting algorithms, highlighting how algorithmic structure, memory access, and parallelization strategies affect real-world performance.

🧠 Motivation

Modern CPUs derive performance not from higher clock speeds, but from parallelism:

Multi-core execution
Cache hierarchies
Memory locality

This project was built to:

Understand why parallel code speeds up (or doesn’t)
Compare sequential vs parallel implementations fairly
Apply theory → systems-level reality

🧩 Project Structure

parallel_algorithm/
│
├── src/
│   ├── graph/
│   │   ├── graph.h
│   │   └── bfs.{h,cpp}
│   │
│   ├── matrix/
│   │   ├── matmul.h
│   │   ├── matmul_seq.cpp
│   │   └── matmul_parallel.cpp
│   │
│   ├── sort/
│   │   ├── mergesort.h
│   │   ├── mergesort_seq.cpp
│   │   └── mergesort_parallel.cpp
│   │
│   ├── utils/
│   │   ├── timer.h
│   │   └── graph_utils.h
│   │
│   └── main.cpp
│
├── CMakeLists.txt
└── README.md

⚙️ Technologies Used

Language: C++20
Parallelism: OpenMP
Build System: CMake
Compiler: GCC / MinGW (OpenMP enabled)
Platform: Linux / Windows

🧪 Benchmarked Algorithms

1️⃣ Graph Algorithms — BFS

Goal: Measure the impact of parallelism on irregular workloads.

Sequential BFS
Parallel BFS using frontier-based expansion

🔍 Key Insight:

BFS is often memory-bound, so speedup depends heavily on graph structure and cache behavior.

2️⃣ Matrix Multiplication (M × M)

Implementations

Sequential Naïve (O(N³))
Blocked (Cache-Aware) Parallel MatMul

Optimizations Applied

Loop blocking for L1/L2 cache locality
#pragma omp parallel for collapse(2)
Avoided false sharing by careful loop ordering

🔍 Why Blocking Matters:

Naïve matrix multiplication causes massive cache misses. Blocking ensures that submatrices stay hot in cache, resulting in:

10–100× speedup vs naïve sequential
Near-linear scaling up to core count

3️⃣ Sorting — Merge Sort

Implementations

Sequential Merge Sort
Parallel Merge Sort using OpenMP sections

Design Choices

Depth-limited parallel recursion
Parallel work only at higher recursion levels
Sequential fallback to reduce overhead

🔍 Key Insight:

Uncontrolled parallel recursion slows down performance due to thread oversubscription.

⏱ Benchmarking Methodology

High-resolution wall-clock timer (std::chrono)
Fixed random seeds for reproducibility
Identical input data across runs
Multiple thread counts tested:

Threads: 1, 2, 4, 8, 12

📈 Sample Output

Parallel Matrix Multiply:

Threads: 1  | Time: 602112 ms | Speedup: 1.00
Threads: 2  | Time: 315420 ms | Speedup: 1.92
Threads: 4  | Time: 162831 ms | Speedup: 3.73
Threads: 8  | Time:  89120 ms | Speedup: 6.82

Parallel BFS traversal:

Thread: 1 | Time: 34.9509 ms
Thread: 2 | Time: 19.7291 ms
Thread: 4 | Time: 12.3914 ms
Thread: 8 | Time: 7.9011 ms
Thread: 12 | Time: 6.9935 ms

Parallel Merge Sorting:

Threads: 1 | Time: 682.306 ms | Speedup: 1.26748
Threads: 2 | Time: 932.408 ms | Speedup: 0.9275
Threads: 4 | Time: 601.601 ms | Speedup: 1.43751
Threads: 8 | Time: 621.919 ms | Speedup: 1.39055

🧠 Key Learnings

Parallelism ≠ automatic speedup
Memory locality is as important as thread count
Over-parallelization can degrade performance
OpenMP is powerful but must be controlled

🚧 Known Limitations

No SIMD intrinsics (AVX) yet
NUMA effects not explicitly handled
Sorting performance sensitive to input size

🛠 How to Build & Run

mkdir build
cd build
cmake ..
cmake --build .
./parallel_algorithm

Ensure OpenMP support is enabled in your compiler.

📌 Future Enhancements

SIMD-optimized matrix multiplication (AVX2/AVX-512)
Task-based parallelism (omp task)
NUMA-aware allocation
CSV benchmark export + plots

🎓 Academic & Resume Value

This project demonstrates:

Parallel algorithm design
Systems-level performance thinking
Cache-aware programming
OpenMP mastery

📌 Ideal for:

Systems / HPC interviews
Research internships
Performance engineering roles

📜 License

MIT License — free to use, modify, and distribute.

"Fast code is not written — it is engineered." 🔥

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Parallel Algorithms Benchmark Suite

🧠 Motivation

🧩 Project Structure

⚙️ Technologies Used

🧪 Benchmarked Algorithms

1️⃣ Graph Algorithms — BFS

2️⃣ Matrix Multiplication (M × M)

Implementations

Optimizations Applied

3️⃣ Sorting — Merge Sort

Implementations

Design Choices

⏱ Benchmarking Methodology

📈 Sample Output

🧠 Key Learnings

🚧 Known Limitations

🛠 How to Build & Run

📌 Future Enhancements

🎓 Academic & Resume Value

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Parallel Algorithms Benchmark Suite

🧠 Motivation

🧩 Project Structure

⚙️ Technologies Used

🧪 Benchmarked Algorithms

1️⃣ Graph Algorithms — BFS

2️⃣ Matrix Multiplication (M × M)

Implementations

Optimizations Applied

3️⃣ Sorting — Merge Sort

Implementations

Design Choices

⏱ Benchmarking Methodology

📈 Sample Output

🧠 Key Learnings

🚧 Known Limitations

🛠 How to Build & Run

📌 Future Enhancements

🎓 Academic & Resume Value

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages