Skip to content

animesh-94/parallel_algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Parallel Algorithms Benchmark Suite

A research-grade C++ project exploring sequential vs parallel performance across core algorithmic domains

This project is a comprehensive study of parallel algorithm design and performance using modern C++ (C++20) and OpenMP. It benchmarks graph traversal, matrix multiplication, and sorting algorithms, highlighting how algorithmic structure, memory access, and parallelization strategies affect real-world performance.


🧠 Motivation

Modern CPUs derive performance not from higher clock speeds, but from parallelism:

  • Multi-core execution
  • Cache hierarchies
  • Memory locality

This project was built to:

  • Understand why parallel code speeds up (or doesn’t)
  • Compare sequential vs parallel implementations fairly
  • Apply theory → systems-level reality

🧩 Project Structure

parallel_algorithm/
│
├── src/
│   ├── graph/
│   │   ├── graph.h
│   │   └── bfs.{h,cpp}
│   │
│   ├── matrix/
│   │   ├── matmul.h
│   │   ├── matmul_seq.cpp
│   │   └── matmul_parallel.cpp
│   │
│   ├── sort/
│   │   ├── mergesort.h
│   │   ├── mergesort_seq.cpp
│   │   └── mergesort_parallel.cpp
│   │
│   ├── utils/
│   │   ├── timer.h
│   │   └── graph_utils.h
│   │
│   └── main.cpp
│
├── CMakeLists.txt
└── README.md

⚙️ Technologies Used

  • Language: C++20
  • Parallelism: OpenMP
  • Build System: CMake
  • Compiler: GCC / MinGW (OpenMP enabled)
  • Platform: Linux / Windows

🧪 Benchmarked Algorithms

1️⃣ Graph Algorithms — BFS

Goal: Measure the impact of parallelism on irregular workloads.

  • Sequential BFS
  • Parallel BFS using frontier-based expansion

🔍 Key Insight:

BFS is often memory-bound, so speedup depends heavily on graph structure and cache behavior.


2️⃣ Matrix Multiplication (M × M)

Implementations

  • Sequential Naïve (O(N³))
  • Blocked (Cache-Aware) Parallel MatMul

Optimizations Applied

  • Loop blocking for L1/L2 cache locality
  • #pragma omp parallel for collapse(2)
  • Avoided false sharing by careful loop ordering

🔍 Why Blocking Matters:

Naïve matrix multiplication causes massive cache misses. Blocking ensures that submatrices stay hot in cache, resulting in:

  • 10–100× speedup vs naïve sequential
  • Near-linear scaling up to core count

3️⃣ Sorting — Merge Sort

Implementations

  • Sequential Merge Sort
  • Parallel Merge Sort using OpenMP sections

Design Choices

  • Depth-limited parallel recursion
  • Parallel work only at higher recursion levels
  • Sequential fallback to reduce overhead

🔍 Key Insight:

Uncontrolled parallel recursion slows down performance due to thread oversubscription.


⏱ Benchmarking Methodology

  • High-resolution wall-clock timer (std::chrono)
  • Fixed random seeds for reproducibility
  • Identical input data across runs
  • Multiple thread counts tested:
Threads: 1, 2, 4, 8, 12

📈 Sample Output

Parallel Matrix Multiply:

Threads: 1  | Time: 602112 ms | Speedup: 1.00
Threads: 2  | Time: 315420 ms | Speedup: 1.92
Threads: 4  | Time: 162831 ms | Speedup: 3.73
Threads: 8  | Time:  89120 ms | Speedup: 6.82
Parallel BFS traversal:

Thread: 1 | Time: 34.9509 ms
Thread: 2 | Time: 19.7291 ms
Thread: 4 | Time: 12.3914 ms
Thread: 8 | Time: 7.9011 ms
Thread: 12 | Time: 6.9935 ms
Parallel Merge Sorting:

Threads: 1 | Time: 682.306 ms | Speedup: 1.26748
Threads: 2 | Time: 932.408 ms | Speedup: 0.9275
Threads: 4 | Time: 601.601 ms | Speedup: 1.43751
Threads: 8 | Time: 621.919 ms | Speedup: 1.39055

🧠 Key Learnings

  • Parallelism ≠ automatic speedup
  • Memory locality is as important as thread count
  • Over-parallelization can degrade performance
  • OpenMP is powerful but must be controlled

🚧 Known Limitations

  • No SIMD intrinsics (AVX) yet
  • NUMA effects not explicitly handled
  • Sorting performance sensitive to input size

🛠 How to Build & Run

mkdir build
cd build
cmake ..
cmake --build .
./parallel_algorithm

Ensure OpenMP support is enabled in your compiler.


📌 Future Enhancements

  • SIMD-optimized matrix multiplication (AVX2/AVX-512)
  • Task-based parallelism (omp task)
  • NUMA-aware allocation
  • CSV benchmark export + plots

🎓 Academic & Resume Value

This project demonstrates:

  • Parallel algorithm design
  • Systems-level performance thinking
  • Cache-aware programming
  • OpenMP mastery

📌 Ideal for:

  • Systems / HPC interviews
  • Research internships
  • Performance engineering roles

📜 License

MIT License — free to use, modify, and distribute.


"Fast code is not written — it is engineered." 🔥

About

Parallel Algorithms Playground implements and benchmarks sequential vs OpenMP-parallel BFS, matrix multiplication, and merge sort in C++, focusing on scalability, speedup, and synchronization behavior.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors