-
Notifications
You must be signed in to change notification settings - Fork 14
Added spill to shared and launch bounds #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
gimmik/kernels/cuda/bstream.mako
Outdated
| const ${dtype}* __restrict__ b, int ldb, | ||
| ${dtype}* __restrict__ c, int ldc) | ||
| { | ||
| #if ( ( defined(__CUDACC_VER_MAJOR__) && ( __CUDACC_VER_MAJOR__ >= 13 ) ) || \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would CUDACC_VER_MAJOR not be defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can't think of a time when those wouldn't be defined when compiled with Nvidia tools. But there are some third-party tools, like SCALE, that claim to be able to compile CUDA for other accelerators, and I have no idea for those. So I thought it was good practice to check if they exist first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just check directly. Also do we need to care about CUDA 12? Seems easier to just require 13 or later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, changed this to just cuda 13.
| const ${dtype}* __restrict__ b, int ldb, | ||
| ${dtype}* __restrict__ c, int ldc) | ||
| { | ||
| #if ( __CUDACC_VER_MAJOR__ >= 13 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to go at the start of a function or can we move it down after the variable declarations so that it only needs to appear once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it needs to come first.
|
Here is the FP64 performance improvement in % for N = 48^3.
|
|
Do you have absolute numbers so peak of FLOPs/bandwdith we achieve? |
Added spilling to shared for the two kernels that don't already use shared memory. This feature requires cuda >= 12.9.
Additionally, I added launch bounds to the cuda kernels. This generally gives a boost to performance, but especially helps when spilling to shared.