Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions benchmarks/nvidia-sdpa/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM nvcr.io/nvidia/pytorch:25.09-py3

RUN pip install --upgrade pip && \
pip install seaborn
Comment on lines +3 to +4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve Docker image efficiency and reduce the number of layers, it's a good practice to combine RUN instructions. The two pip install commands can be merged into one. Using --no-cache-dir with pip prevents caching and further reduces the final image size.

RUN pip install --no-cache-dir --upgrade pip seaborn


RUN apt-get update && \
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.1-1_all.deb && \
dpkg -i cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get -y install cudnn9-cuda-13
Comment on lines +6 to +10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To optimize the Docker image size, it's important to clean up temporary files and caches within the same RUN layer they are created. The downloaded .deb file should be removed, and the apt cache should be cleared after installation to reduce the final image size.

RUN apt-get update && \
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.1-1_all.deb && \
    dpkg -i cuda-keyring_1.1-1_all.deb && \
    rm cuda-keyring_1.1-1_all.deb && \
    apt-get update && \
    apt-get -y install cudnn9-cuda-13 && \
    rm -rf /var/lib/apt/lists/*


RUN pip uninstall -y cudnn

COPY benchmark_bf16_sdpa.py .

COPY benchmark_fp8_sdpa.py .

COPY benchmark_single_sdpa.py .
Comment on lines +14 to +18

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To reduce the number of Docker image layers and improve readability, you can combine multiple COPY instructions for similar files into a single one using a wildcard. This makes the Dockerfile more concise.

COPY benchmark_*.py .


ENV LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/:$LD_LIBRARY_PATH

WORKDIR /workspace
13 changes: 13 additions & 0 deletions benchmarks/nvidia-sdpa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Scaled Dot Product Attention Benchmark

The upstream NVIDIA benchmark, which is part of the cudnn-frontend packages (found at https://github.com/NVIDIA/cudnn-frontend/tree/main/benchmark/sdpa_benchmark_training) is using x86_64 specific packages, which doesn't work on GB300 as Grace CPUs are arm (aarch64).

In this repository you'll find a simple fixed Dockerfile which can be used on Nvidia Grace based systems.

Steps:
1. Clone the repository
- `git clone https://github.com/NVIDIA/cudnn-frontend`
2. Replace the Dockerfile at `cudnn-frontend/benchmark/sdpa_benchmark_training/Dockerfile` with the one from this repo.
3. Follow the instructions as normal after this
- `docker build -t cudnn_attention_benchmark .`
- `docker run -it --gpus all --rm -v $(pwd):/workspace cudnn_attention_benchmark`