-
Notifications
You must be signed in to change notification settings - Fork 51
Change XCD remapping logic #2161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates the XCD (chiplet) remapping logic to use the HipKittens algorithm for improved L2 cache optimization. The key change is to compute chunkSize based on groupSize squared, creating spatially local tiles that align with what a chiplet will process.
Key Changes:
- Updated
rearrangeWorkgroupsForXCCfunction to implement HipKittens-style remapping with chunkSize parameter - Modified groupSize heuristic to account for number of chiplets:
groupSize = ceil(sqrt(numCU / numChiplets)) * (bitWidthOut / bitWidthIn) - Changed chunkSize calculation to
min(groupSize * groupSize, max(1, gridSize / numChiplets))to ensure XCD remapping even for small kernels - Extended gridGroupSize tuning parameter space to include values 1 and 2
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| mlir/lib/Dialect/Rock/Transforms/GridLayoutEmitter.cpp | Implements new XCD remapping algorithm with updated groupSize heuristic and chunkSize calculation; moves XCD remapping after groupSize computation |
| mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp | Adds gridGroupSize tuning values 1 and 2 to the finetuning parameter space |
| mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir | Updates test cases with larger matrix dimensions and new expected CHECK patterns for the revised XCD remapping logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index | ||
| // CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "GROUD_ID" should be "GROUP_ID" in variable name GROUD_ID.
| // CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index | |
| // CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index | |
| // CHECK-DAG: %[[GROUP_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index | |
| // CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUP_ID]], %c6 : index |
| int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * | ||
| (bitWidthOut / bitWidthIn); |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential precision loss in groupSize calculation due to integer division. The expression info.numCU / numChiplets performs integer division before being passed to sqrt. Consider using floating point division: std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets)) to avoid truncating before the square root operation.
| int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * | |
| (bitWidthOut / bitWidthIn); | |
| int64_t groupSize = static_cast<int64_t>( | |
| std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets) * | |
| (static_cast<double>(bitWidthOut) / bitWidthIn))); |
| // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index | ||
| // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index | ||
| // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index | ||
| // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhernandez0 see this and other spelling errors down below
| // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c25 : index | ||
| // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c25 : index | ||
| // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c100 : index | ||
| // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c25 : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.
| // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c81 : index | ||
| // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c81 : index | ||
| // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c648 : index | ||
| // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c81 : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.
| // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index | ||
| // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.
| // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index | ||
| // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.
| // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c800 : index | ||
| // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.
| // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c648 : index | ||
| // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.
| // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index | ||
| // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index | ||
| // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index | ||
| // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.
justinrosner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's good perf improvement with this, thanks Daniel!
| // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index | ||
| // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index | ||
| // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index | ||
| // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index | ||
| // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhernandez0 see this and other spelling errors down below
pabloantoniom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. HipKittens approach is probably better than what we currently have. Some high-level comments:
- What hardware are you using for the results? MI300 or MI350? I think the XCD remapping might be quite sensitive to the specific GPU, specially because MI300 does not have the same amount of CU per XCD.
- I'm happy to merge this, but I am against of working based only on results. Just like I did not like previous approach (we ran some tests and dividing by 2 seemed to be the best - spoiler: it was not), I don't like changing the logic because some new tests suggest it's better. I think we also need some analysis on why is better. I like what we did last time when we computed a table and checked the mapping of different algorithms. I think this PR is lacking precisely that, to understand how it behaves compared to what we had. I can take care of looking at this and post the results here.
- I think this does not implement the original ticket, which was aimed at doing a deep dive on this and understanding what is the best approach for XCD remapping. HipKittens approach might be the best remapping actually, but we lack the analysis to prove it (related to my previous point).
| std::vector<std::vector<int64_t>> finetuningParams = { | ||
| {0, 1}, // outputSwizzle | ||
| wavesPerEUList, // wavesPerEU | ||
| {0, 4, 8, 16, 32, 64}}; // gridGroupSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much speedup are we getting out of improving the search space of tuning?
MI350
I have an excel file with the remapping. I tried to explain in the PR description why I think this is a better approach. I think I can prepare a presentation to explain it. |
35de766 to
e7acb82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int block = numChiplets * chunkSize; | ||
| int limit = (gridSize / block) * block; |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local variables block and limit are declared as int, but they are derived from int64_t parameters (gridSize, numChiplets, chunkSize). This could lead to integer overflow if the values are large (e.g., gridSize > INT_MAX / numChiplets). Consider changing these to int64_t for consistency and to prevent potential overflow issues.
| int block = numChiplets * chunkSize; | |
| int limit = (gridSize / block) * block; | |
| int64_t block = numChiplets * chunkSize; | |
| int64_t limit = (gridSize / block) * block; |
e73c1c2 to
dff5e38
Compare
|
We can't merge this PR yet because it fails on gfx950: Changing it to |
Motivation
Try HipKittens XCD remapping.
Technical Details
HipKittens XCD remapping creates a chunkSize that is related to groupSize (for https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations L2 optimization). They set
chunkSize=groupSize*groupSize, which to me makes sense, as that's what an XCD is going to process.I've also updated the groupSize heuristic from
groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * (bitWidthOut / bitWidthIn)togroupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * (bitWidthOut / bitWidthIn)to take into account the number of chiplets. This (the original formula and the new one) assumes we have a single workgroup per CU. Now, we make sure the group only stays inside one chiplet only.I've also changed how they define chunkSize, they do:
Here we are doing:
Note that what they do wouldn't do any remapping for small kernels, the change makes sure we always do XCD remapping.
Perf data on MI350
Comparing this branch vs greedy_third_iteration branch (it's based on it). We get improvement (>1.05x) on 26 GEMMs of the tier1 list. It gets worser perf for 8 GEMMs, but all of them are >0.91x. Max speed up is 1.16x.
Note I've set the same groupSize tuning space for greedy_third_iteration as well.
I'm getting attention results as well.
Partial results for attention, out of the first 47 configs, it improves one of them 1.15x, the rest are roughly the same run-time.
Test Plan
All tests pass.
Test Result
All tests pass.
Submission Checklist