Minimal Softmax CUDA Implementation

A minimal CUDA implementation of softmax with multiple kernel variants for educational purposes.

Overview

This project implements and compares different softmax kernels in CUDA, ranging from basic to optimized versions. It includes a Python reference implementation for validation and testing.

Performance Comparison (on NVIDIA RTX 4500 Ada)

Kernel	Description	Avg Time (ms)	Throughput (elements/ms)	Speedup vs Kernel 0
0	Basic Softmax	0.29562	14.2M	1.0x (baseline)
1	Shared Memory Optimized	0.09587	43.8M	3.1x
2	Warp Primitives	0.03016	139.1M	9.8x
3	2D Block with Warp Primitives	0.01770	236.9M	16.7x
4	Multi-row per Warp	0.01669	251.2M	17.7x

Key Observations:

Significant optimization impact: Kernel 4 is 17.7x faster than the baseline Kernel 0
Warp-level optimizations: Kernels 2-4 show dramatic improvements by leveraging warp shuffle instructions
Memory hierarchy utilization: Kernel 1 improves over baseline by using shared memory
Best performance: Kernel 4 achieves the highest throughput (251M elements/ms) with multi-row per warp processing

Test Configuration:

GPU: NVIDIA RTX 4500 Ada (24GB VRAM)
CUDA Version: 12.8
Batch Size: 32768
Dimension: 128
Repetitions: 100

Requirements

CUDA Toolkit (tested with CUDA 12.8)
CMake 3.10+
Python 3.6+ with NumPy
GCC/G++ with C++11 support

Quick Start

Clone the repository:

git clone https://github.com/ggluo/Minimal_Softmax.git
cd Minimal_Softmax

Run the complete test suite:
```
./run.sh
```
This will:
- Generate test data using Python
- Build all CUDA kernels
- Test each kernel (0-4)
- Compare CUDA outputs with Python reference
- Report performance metrics

Kernel Details

Kernel 0: Basic Softmax

Simple grid-stride loop
Each thread processes multiple elements
No shared memory or warp optimizations

Kernel 1: Shared Memory Optimized

Uses dynamic shared memory for reduction
Each block processes one row
Better memory coalescing

Kernel 2: Warp Primitives

Leverages warp shuffle instructions
Each warp processes one row
Efficient warp-level reductions

Kernel 3: 2D Block with Warp Primitives

2D block layout (32x4 threads)
Each warp processes one row
Better occupancy with multiple warps per block

Kernel 4: Multi-row per Warp

Template-based implementation
Each warp processes multiple rows (ROWS_PER_WARP=4)
Register caching for better performance
Assumes input dimension is up to 512

Project Structure

Minimal_Softmax/
├── src/
│   ├── kernel/
│   │   ├── common.cuh          # Warp reduction utilities
│   │   ├── kernel_0.cuh        # Basic softmax kernel
│   │   ├── kernel_1.cuh        # Shared memory optimized kernel
│   │   ├── kernel_2.cuh        # Warp primitive kernel
│   │   ├── kernel_3.cuh        # 2D block kernel
│   │   └── kernel_4.cuh        # Multi-row per warp kernel
│   ├── kernel.cuh              # Kernel declarations
│   ├── utils.cuh               # Utility function declarations
│   └── utils.cu                # Utility implementations
├── softmax.cu                  # Main CUDA test program
├── softmax.py                  # Python reference implementation
├── compare.py                  # File comparison utility
├── run.sh                      # Automated test script
└── CMakeLists.txt              # Build configuration

License

This project is intended for educational purposes. Feel free to use and modify the code as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Minimal Softmax CUDA Implementation

Overview

Performance Comparison (on NVIDIA RTX 4500 Ada)

Requirements

Quick Start

Kernel Details

Kernel 0: Basic Softmax

Kernel 1: Shared Memory Optimized

Kernel 2: Warp Primitives

Kernel 3: 2D Block with Warp Primitives

Kernel 4: Multi-row per Warp

Project Structure

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
LICENSE		LICENSE
README.md		README.md
compare.py		compare.py
run.sh		run.sh
softmax.cu		softmax.cu
softmax.py		softmax.py

License

ggluo/Minimal_Softmax

Folders and files

Latest commit

History

Repository files navigation

Minimal Softmax CUDA Implementation

Overview

Performance Comparison (on NVIDIA RTX 4500 Ada)

Requirements

Quick Start

Kernel Details

Kernel 0: Basic Softmax

Kernel 1: Shared Memory Optimized

Kernel 2: Warp Primitives

Kernel 3: 2D Block with Warp Primitives

Kernel 4: Multi-row per Warp

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages