Skip to content

Conversation

@dhernandez0
Copy link
Contributor

@dhernandez0 dhernandez0 commented Dec 9, 2025

Motivation

Try HipKittens XCD remapping.

Technical Details

HipKittens XCD remapping creates a chunkSize that is related to groupSize (for https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations L2 optimization). They set chunkSize=groupSize*groupSize, which to me makes sense, as that's what an XCD is going to process.

I've also updated the groupSize heuristic from groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * (bitWidthOut / bitWidthIn) to groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * (bitWidthOut / bitWidthIn) to take into account the number of chiplets. This (the original formula and the new one) assumes we have a single workgroup per CU. Now, we make sure the group only stays inside one chiplet only.

I've also changed how they define chunkSize, they do:

int64_t chunkSize = groupSize * groupSize;

Here we are doing:

int64_t chunkSize = std::min(groupSize * groupSize,
                                 std::max(int64_t{1}, gridSize / numChiplets));

Note that what they do wouldn't do any remapping for small kernels, the change makes sure we always do XCD remapping.

Perf data on MI350

Comparing this branch vs greedy_third_iteration branch (it's based on it). We get improvement (>1.05x) on 26 GEMMs of the tier1 list. It gets worser perf for 8 GEMMs, but all of them are >0.91x. Max speed up is 1.16x.

Note I've set the same groupSize tuning space for greedy_third_iteration as well.

I'm getting attention results as well.
Partial results for attention, out of the first 47 configs, it improves one of them 1.15x, the rest are roughly the same run-time.

Test Plan

All tests pass.

Test Result

All tests pass.

Submission Checklist

@dhernandez0 dhernandez0 self-assigned this Dec 9, 2025
@dhernandez0 dhernandez0 requested a review from causten as a code owner December 9, 2025 10:39
@dhernandez0 dhernandez0 changed the title Change XCD remapping logic [DRAFT] Change XCD remapping logic Dec 9, 2025
@dhernandez0 dhernandez0 changed the title [DRAFT] Change XCD remapping logic Change XCD remapping logic Dec 10, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the XCD (chiplet) remapping logic to use the HipKittens algorithm for improved L2 cache optimization. The key change is to compute chunkSize based on groupSize squared, creating spatially local tiles that align with what a chiplet will process.

Key Changes:

  • Updated rearrangeWorkgroupsForXCC function to implement HipKittens-style remapping with chunkSize parameter
  • Modified groupSize heuristic to account for number of chiplets: groupSize = ceil(sqrt(numCU / numChiplets)) * (bitWidthOut / bitWidthIn)
  • Changed chunkSize calculation to min(groupSize * groupSize, max(1, gridSize / numChiplets)) to ensure XCD remapping even for small kernels
  • Extended gridGroupSize tuning parameter space to include values 1 and 2

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File Description
mlir/lib/Dialect/Rock/Transforms/GridLayoutEmitter.cpp Implements new XCD remapping algorithm with updated groupSize heuristic and chunkSize calculation; moves XCD remapping after groupSize computation
mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp Adds gridGroupSize tuning values 1 and 2 to the finetuning parameter space
mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir Updates test cases with larger matrix dimensions and new expected CHECK patterns for the revised XCD remapping logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 337 to 338
// CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index
// CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "GROUD_ID" should be "GROUP_ID" in variable name GROUD_ID.

Suggested change
// CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index
// CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index
// CHECK-DAG: %[[GROUP_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index
// CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUP_ID]], %c6 : index

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +98
int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) *
(bitWidthOut / bitWidthIn);
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential precision loss in groupSize calculation due to integer division. The expression info.numCU / numChiplets performs integer division before being passed to sqrt. Consider using floating point division: std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets)) to avoid truncating before the square root operation.

Suggested change
int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) *
(bitWidthOut / bitWidthIn);
int64_t groupSize = static_cast<int64_t>(
std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets) *
(static_cast<double>(bitWidthOut) / bitWidthIn)));

Copilot uses AI. Check for mistakes.
Comment on lines 91 to 96
// CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index
// CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index
// CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index
// CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index
// CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
// CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhernandez0 see this and other spelling errors down below

Comment on lines 111 to 116
// CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c25 : index
// CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c25 : index
// CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c100 : index
// CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c25 : index
// CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
// CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

Copilot uses AI. Check for mistakes.
Comment on lines 305 to 310
// CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c81 : index
// CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c81 : index
// CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c648 : index
// CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c81 : index
// CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
// CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

Copilot uses AI. Check for mistakes.
Comment on lines 334 to 335
// CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index
// CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot uses AI. Check for mistakes.
Comment on lines 97 to 98
// CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index
// CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot uses AI. Check for mistakes.
Comment on lines 117 to 118
// CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c800 : index
// CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot uses AI. Check for mistakes.
Comment on lines 311 to 312
// CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c648 : index
// CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot uses AI. Check for mistakes.
Comment on lines 328 to 333
// CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index
// CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index
// CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index
// CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index
// CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
// CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@justinrosner justinrosner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's good perf improvement with this, thanks Daniel!

Comment on lines 91 to 96
// CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index
// CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index
// CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index
// CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index
// CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
// CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhernandez0 see this and other spelling errors down below

Copy link
Contributor

@pabloantoniom pabloantoniom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. HipKittens approach is probably better than what we currently have. Some high-level comments:

  • What hardware are you using for the results? MI300 or MI350? I think the XCD remapping might be quite sensitive to the specific GPU, specially because MI300 does not have the same amount of CU per XCD.
  • I'm happy to merge this, but I am against of working based only on results. Just like I did not like previous approach (we ran some tests and dividing by 2 seemed to be the best - spoiler: it was not), I don't like changing the logic because some new tests suggest it's better. I think we also need some analysis on why is better. I like what we did last time when we computed a table and checked the mapping of different algorithms. I think this PR is lacking precisely that, to understand how it behaves compared to what we had. I can take care of looking at this and post the results here.
  • I think this does not implement the original ticket, which was aimed at doing a deep dive on this and understanding what is the best approach for XCD remapping. HipKittens approach might be the best remapping actually, but we lack the analysis to prove it (related to my previous point).

std::vector<std::vector<int64_t>> finetuningParams = {
{0, 1}, // outputSwizzle
wavesPerEUList, // wavesPerEU
{0, 4, 8, 16, 32, 64}}; // gridGroupSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much speedup are we getting out of improving the search space of tuning?

@dhernandez0
Copy link
Contributor Author

  • What hardware are you using for the results? MI300 or MI350? I think the XCD remapping might be quite sensitive to the specific GPU, specially because MI300 does not have the same amount of CU per XCD.

MI350

I'm happy to merge this, but I am against of working based only on results. Just like I did not like previous approach (we ran some tests and dividing by 2 seemed to be the best - spoiler: it was not), I don't like changing the logic because some new tests suggest it's better. I think we also need some analysis on why is better. I like what we did last time when we computed a table and checked the mapping of different algorithms. I think this PR is lacking precisely that, to understand how it behaves compared to what we had. I can take care of looking at this and post the results here.
I think this does not implement the original ticket, which was aimed at doing a deep dive on this and understanding what is the best approach for XCD remapping. HipKittens approach might be the best remapping actually, but we lack the analysis to prove it (related to my previous point).

I have an excel file with the remapping. I tried to explain in the PR description why I think this is a better approach. I think I can prepare a presentation to explain it.

Base automatically changed from greedy_third_iteration to develop December 12, 2025 14:05
@dhernandez0 dhernandez0 requested a review from Copilot December 15, 2025 09:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 50 to 51
int block = numChiplets * chunkSize;
int limit = (gridSize / block) * block;
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local variables block and limit are declared as int, but they are derived from int64_t parameters (gridSize, numChiplets, chunkSize). This could lead to integer overflow if the values are large (e.g., gridSize > INT_MAX / numChiplets). Consider changing these to int64_t for consistency and to prevent potential overflow issues.

Suggested change
int block = numChiplets * chunkSize;
int limit = (gridSize / block) * block;
int64_t block = numChiplets * chunkSize;
int64_t limit = (gridSize / block) * block;

Copilot uses AI. Check for mistakes.
@dhernandez0
Copy link
Contributor Author

We can't merge this PR yet because it fails on gfx950:

bin/rocmlir-gen -g 1 -m 32768 -k 32 -n 32769 --transA=false --transB=false --transC=true -t f4E2M1FN -scale_a_dtype f8E8M0FNU -scale_b_dtype f8E8M0FNU --schedule_version  2  --operation gemm -out_dtype f32 --scaledGemm --arch gfx950:sramecc+:xnack- -pv_with_gpu   | bin/rocmlir-driver -c -gO 3 | external/llvm-project/llvm/bin/mlir-runner -O2 --shared-libs=external/llvm-project/llvm/lib/libmlir_rocm_runtime.so,lib/libconv-validation-wrappers.so,external/llvm-project/llvm/lib/libmlir_runner_utils.so,external/llvm-project/llvm/lib/libmlir_float16_utils.so --entry-point-result=void

Changing it to -gO 0 makes the command pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants