Skip to content

Conversation

@WillTrojak
Copy link
Member

Added spilling to shared for the two kernels that don't already use shared memory. This feature requires cuda >= 12.9.

Additionally, I added launch bounds to the cuda kernels. This generally gives a boost to performance, but especially helps when spilling to shared.

const ${dtype}* __restrict__ b, int ldb,
${dtype}* __restrict__ c, int ldc)
{
#if ( ( defined(__CUDACC_VER_MAJOR__) && ( __CUDACC_VER_MAJOR__ >= 13 ) ) || \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would CUDACC_VER_MAJOR not be defined?

Copy link
Member Author

@WillTrojak WillTrojak Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I can't think of a time when those wouldn't be defined when compiled with Nvidia tools. But there are some third-party tools, like SCALE, that claim to be able to compile CUDA for other accelerators, and I have no idea for those. So I thought it was good practice to check if they exist first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just check directly. Also do we need to care about CUDA 12? Seems easier to just require 13 or later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed this to just cuda 13.

const ${dtype}* __restrict__ b, int ldb,
${dtype}* __restrict__ c, int ldc)
{
#if ( __CUDACC_VER_MAJOR__ >= 13 )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to go at the start of a function or can we move it down after the variable declarations so that it only needs to appear once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it needs to come first.

@WillTrojak
Copy link
Member Author

Here is the FP64 performance improvement in % for N = 48^3.

p mat hex pri tet
2 m0 -1.826484905 2.009701195 6.944436481
2 m3 -0.295154775 8.019251568 -7.427670957
2 m6 -2.818445786 -2.887391268 2.33066366
2 m132 -5.844456427 -7.246381148 7.939180555
2 m460 0.203088554 0 -11.92196406
3 m0 8.157557453 -0.128480629 0
3 m3 0.017075499 -2.372395288 0.037665092
3 m6 -1.078167805 0.760461761 -0.100782278
3 m132 -0.080955364 0.077853383 3.225807287
3 m460 5.879638405 0 0.058428688
4 m0 0.896861374 -0.018543607 1.434853369
4 m3 -0.615201494 -0.514135272 0
4 m6 0.290004409 26.83560534 0.009443318
4 m132 -0.007251841 0.051970419 0
4 m460 -0.239808226 2.616519779 0.012443141
5 m0 0.559627573 3.384434896 0
5 m3 0.830226822 2.435323222 0
5 m6 -5.626592215 17.21821181 0
5 m132 -0.603549919 6.455492999 0
5 m460 0.863485539 33.61152995 0.45769551
6 m0 0.343045801 10.4882413 0.032300137
6 m3 0.683529765 1.207283004 0.47698948
6 m6 3.375551986 2.312175893 -0.114073593
6 m132 0.071276331 21.72202425 0.021733253
6 m460 1.651250203 9.148251482 0.004187673

@FreddieWitherden
Copy link
Contributor

Do you have absolute numbers so peak of FLOPs/bandwdith we achieve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants