A minimal CUDA implementation of softmax with multiple kernel variants for educational purposes.
This project implements and compares different softmax kernels in CUDA, ranging from basic to optimized versions. It includes a Python reference implementation for validation and testing.
| Kernel | Description | Avg Time (ms) | Throughput (elements/ms) | Speedup vs Kernel 0 |
|---|---|---|---|---|
| 0 | Basic Softmax | 0.29562 | 14.2M | 1.0x (baseline) |
| 1 | Shared Memory Optimized | 0.09587 | 43.8M | 3.1x |
| 2 | Warp Primitives | 0.03016 | 139.1M | 9.8x |
| 3 | 2D Block with Warp Primitives | 0.01770 | 236.9M | 16.7x |
| 4 | Multi-row per Warp | 0.01669 | 251.2M | 17.7x |
Key Observations:
- Significant optimization impact: Kernel 4 is 17.7x faster than the baseline Kernel 0
- Warp-level optimizations: Kernels 2-4 show dramatic improvements by leveraging warp shuffle instructions
- Memory hierarchy utilization: Kernel 1 improves over baseline by using shared memory
- Best performance: Kernel 4 achieves the highest throughput (251M elements/ms) with multi-row per warp processing
Test Configuration:
- GPU: NVIDIA RTX 4500 Ada (24GB VRAM)
- CUDA Version: 12.8
- Batch Size: 32768
- Dimension: 128
- Repetitions: 100
- CUDA Toolkit (tested with CUDA 12.8)
- CMake 3.10+
- Python 3.6+ with NumPy
- GCC/G++ with C++11 support
-
Clone the repository:
git clone https://github.com/ggluo/Minimal_Softmax.git cd Minimal_Softmax -
Run the complete test suite:
./run.sh
This will:
- Generate test data using Python
- Build all CUDA kernels
- Test each kernel (0-4)
- Compare CUDA outputs with Python reference
- Report performance metrics
- Simple grid-stride loop
- Each thread processes multiple elements
- No shared memory or warp optimizations
- Uses dynamic shared memory for reduction
- Each block processes one row
- Better memory coalescing
- Leverages warp shuffle instructions
- Each warp processes one row
- Efficient warp-level reductions
- 2D block layout (32x4 threads)
- Each warp processes one row
- Better occupancy with multiple warps per block
- Template-based implementation
- Each warp processes multiple rows (ROWS_PER_WARP=4)
- Register caching for better performance
- Assumes input dimension is up to 512
Minimal_Softmax/
├── src/
│ ├── kernel/
│ │ ├── common.cuh # Warp reduction utilities
│ │ ├── kernel_0.cuh # Basic softmax kernel
│ │ ├── kernel_1.cuh # Shared memory optimized kernel
│ │ ├── kernel_2.cuh # Warp primitive kernel
│ │ ├── kernel_3.cuh # 2D block kernel
│ │ └── kernel_4.cuh # Multi-row per warp kernel
│ ├── kernel.cuh # Kernel declarations
│ ├── utils.cuh # Utility function declarations
│ └── utils.cu # Utility implementations
├── softmax.cu # Main CUDA test program
├── softmax.py # Python reference implementation
├── compare.py # File comparison utility
├── run.sh # Automated test script
└── CMakeLists.txt # Build configuration
This project is intended for educational purposes. Feel free to use and modify the code as needed.