This repository implements custom CUDA kernels for common ML operations, benchmarked against PyTorch's highly optimized cuBLAS/cuDNN kernels. The goal is to:
- Understand GPU parallelization patterns
- Compare naive kernel performance vs. library implementations
- Build intuition for ML Systems performance engineering
βββ kernels/
β βββ matrix_multiply.cu
β βββ vector_add.cu
β βββ relu.cu
β βββ dot_product.cu
β βββ intro.cu
βββ benchmarks/
β βββ benchmark.py
βββ .gitignore
kernels/: CUDA C++ kernel implementationsbenchmarks/: Python script to benchmark kernels vs. PyTorch
| Kernel | Description |
|---|---|
matrix_multiply |
Matrix multiplication (1024x1024) |
vector_add |
Elementwise vector addition |
relu |
ReLU activation function |
dot_product |
Vector dot product reduction |
intro |
4x4 matrix multiplication demo |
-
Ensure NVIDIA CUDA Toolkit is installed.
-
Compile each
.cufile:
cd kernels
nvcc -o matrix_multiply.exe matrix_multiply.cu
nvcc -o vector_add.exe vector_add.cu
nvcc -o relu.exe relu.cu
nvcc -o dot_product.exe dot_product.cu
nvcc -o intro.exe intro.cuReplace
.exewith no extension if on Linux/Mac.
From the repo root:
cd benchmarks
python benchmark.py| Kernel | PyTorch Time (ms) | Custom CUDA Time (ms) | Speedup |
|---|---|---|---|
| matrix_multiply | 14.28 | 6.95 | 2.06x |
| vector_add | 2.39 | 1.35 | 1.77x |
| relu | 2.97 | 0.56 | 5.35x |
| dot_product | ~0 | 1.33 | Slower |
| intro (4x4 matmul) | 2.25 | 1.08 | 2.09x |
- Matrix multiplication and ReLU kernels show significant speedups, demonstrating effective GPU thread parallelization.
- Vector addition gains are modest, as PyTorch uses cuBLAS kernels optimized near theoretical peak.
- Dot product is slower due to naive reduction implementation vs. PyTorch's warp-level optimized reductions.
- Small matmul (intro) demonstrates kernel launch overhead optimization benefits.
- Implement warp-level reductions for dot product
- Integrate unit tests comparing kernel outputs with PyTorch for correctness validation
- Extend to batched kernels relevant for end-to-end ML pipeline acceleration
Gauri Sharan
MIT License