GPU kernel optimization: Softmax
The following code was tested using the docker image:
nvidia/cuda:12.4.0-devel-ubuntu22.04on a Geforce RTX 2070
- Build Python library with CUDA bindings
cd cuda
pip install .- Test both implementations against Pytorch baseline
python3 assertions.py- Run a benchmark
python3 benchmark.py- Profile both implementations
ncu --set full [-o output_path] python3 -O assertions.py-
Part 3: Online Softmax: WIP
