Change XCD remapping logic #2161

dhernandez0 · 2025-12-09T10:39:03Z

Motivation

Try HipKittens XCD remapping.

Technical Details

HipKittens XCD remapping creates a chunkSize that is related to groupSize (for https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations L2 optimization). They set chunkSize=groupSize*groupSize, which to me makes sense, as that's what an XCD is going to process.

I've also updated the groupSize heuristic from groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * (bitWidthOut / bitWidthIn) to groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) * (bitWidthOut / bitWidthIn) to take into account the number of chiplets. This (the original formula and the new one) assumes we have a single workgroup per CU. Now, we make sure the group only stays inside one chiplet only.

I've also changed how they define chunkSize, they do:

int64_t chunkSize = groupSize * groupSize;

Here we are doing:

int64_t chunkSize = std::min(groupSize * groupSize,
                                 std::max(int64_t{1}, gridSize / numChiplets));

Note that what they do wouldn't do any remapping for small kernels, the change makes sure we always do XCD remapping.

Perf data on MI350

Comparing this branch vs greedy_third_iteration branch (it's based on it). We get improvement (>1.05x) on 26 GEMMs of the tier1 list. It gets worser perf for 8 GEMMs, but all of them are >0.91x. Max speed up is 1.16x.

Note I've set the same groupSize tuning space for greedy_third_iteration as well.

I'm getting attention results as well.
Partial results for attention, out of the first 47 configs, it improves one of them 1.15x, the rest are roughly the same run-time.

Test Plan

All tests pass.

Test Result

All tests pass.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR updates the XCD (chiplet) remapping logic to use the HipKittens algorithm for improved L2 cache optimization. The key change is to compute chunkSize based on groupSize squared, creating spatially local tiles that align with what a chiplet will process.

Key Changes:

Updated rearrangeWorkgroupsForXCC function to implement HipKittens-style remapping with chunkSize parameter
Modified groupSize heuristic to account for number of chiplets: groupSize = ceil(sqrt(numCU / numChiplets)) * (bitWidthOut / bitWidthIn)
Changed chunkSize calculation to min(groupSize * groupSize, max(1, gridSize / numChiplets)) to ensure XCD remapping even for small kernels
Extended gridGroupSize tuning parameter space to include values 1 and 2

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File	Description
mlir/lib/Dialect/Rock/Transforms/GridLayoutEmitter.cpp	Implements new XCD remapping algorithm with updated groupSize heuristic and chunkSize calculation; moves XCD remapping after groupSize computation
mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp	Adds gridGroupSize tuning values 1 and 2 to the finetuning parameter space
mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir	Updates test cases with larger matrix dimensions and new expected CHECK patterns for the revised XCD remapping logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-10T16:00:25Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index
+  // CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index


Spelling error: "GROUD_ID" should be "GROUP_ID" in variable name GROUD_ID.

Suggested change

// CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index

// CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index

// CHECK-DAG: %[[GROUP_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index

// CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUP_ID]], %c6 : index

Copilot · 2025-12-10T16:00:26Z

mlir/lib/Dialect/Rock/Transforms/GridLayoutEmitter.cpp

+  int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) *
+                      (bitWidthOut / bitWidthIn);


Potential precision loss in groupSize calculation due to integer division. The expression info.numCU / numChiplets performs integer division before being passed to sqrt. Consider using floating point division: std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets)) to avoid truncating before the square root operation.

Suggested change

int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) *

(bitWidthOut / bitWidthIn);

int64_t groupSize = static_cast<int64_t>(

std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets) *

(static_cast<double>(bitWidthOut) / bitWidthIn)));

Copilot · 2025-12-10T16:00:26Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index
+  // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index
+  // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index
+  // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index


Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

@dhernandez0 see this and other spelling errors down below

Copilot · 2025-12-10T16:00:27Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c25 : index
+  // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c25 : index
+  // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c100 : index
+  // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c25 : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index


Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

Copilot · 2025-12-10T16:00:27Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c81 : index
+  // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c81 : index
+  // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c648 : index
+  // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c81 : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index


Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

Copilot · 2025-12-10T16:00:27Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index
+  // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index


Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot · 2025-12-10T16:00:28Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index
+  // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index


Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot · 2025-12-10T16:00:28Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c800 : index
+  // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index


Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot · 2025-12-10T16:00:28Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c648 : index
+  // CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index


Spelling error: "FULLBLCOK" should be "FULLBLOCK" in variable name IS_LARGER_THAN_FULLBLCOK.

Copilot · 2025-12-10T16:00:28Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index
+  // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index
+  // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index
+  // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index


Spelling error: "CHUNCK" should be "CHUNK" in variable names CHUNCK_IDX, POS_IN_CHUNCK, and CHUNCK_IDX_BY_BLOCK.

justinrosner

Looks like there's good perf improvement with this, thanks Daniel!

justinrosner · 2025-12-10T16:58:51Z

mlir/test/Dialect/Rock/gridwise_gemm_accel_lowering.mlir

+  // CHECK-DAG: %[[CHUNCK_IDX:.+]] = arith.divui %[[LOCAL_BID]], %c36 : index
+  // CHECK-DAG: %[[POS_IN_CHUNCK:.+]] = arith.remui %[[LOCAL_BID]], %c36 : index
+  // CHECK-DAG: %[[CHUNCK_IDX_BY_BLOCK:.+]] = arith.muli %[[CHUNCK_IDX]], %c288 : index
+  // CHECK-DAG: %[[XCD_BY_CHUNKSIZE:.+]] = arith.muli %[[XCD]], %c36 : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID_PARTIAL:.+]] = arith.addi %[[CHUNCK_IDX_BY_BLOCK]], %[[XCD_BY_CHUNKSIZE]] : index
+  // CHECK-DAG: %[[MAYBE_NEW_BID:.+]] = arith.addi %[[MAYBE_NEW_BID_PARTIAL]], %[[POS_IN_CHUNCK]] : index


@dhernandez0 see this and other spelling errors down below

pabloantoniom

Looks good to me. HipKittens approach is probably better than what we currently have. Some high-level comments:

What hardware are you using for the results? MI300 or MI350? I think the XCD remapping might be quite sensitive to the specific GPU, specially because MI300 does not have the same amount of CU per XCD.
I'm happy to merge this, but I am against of working based only on results. Just like I did not like previous approach (we ran some tests and dividing by 2 seemed to be the best - spoiler: it was not), I don't like changing the logic because some new tests suggest it's better. I think we also need some analysis on why is better. I like what we did last time when we computed a table and checked the mapping of different algorithms. I think this PR is lacking precisely that, to understand how it behaves compared to what we had. I can take care of looking at this and post the results here.
I think this does not implement the original ticket, which was aimed at doing a deep dive on this and understanding what is the best approach for XCD remapping. HipKittens approach might be the best remapping actually, but we lack the analysis to prove it (related to my previous point).

pabloantoniom · 2025-12-11T09:21:42Z

mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp

  std::vector<std::vector<int64_t>> finetuningParams = {
-      {0, 1},                 // outputSwizzle
-      wavesPerEUList,         // wavesPerEU
-      {0, 4, 8, 16, 32, 64}}; // gridGroupSize


I wonder how much speedup are we getting out of improving the search space of tuning?

dhernandez0 · 2025-12-11T09:31:29Z

What hardware are you using for the results? MI300 or MI350? I think the XCD remapping might be quite sensitive to the specific GPU, specially because MI300 does not have the same amount of CU per XCD.

MI350

I'm happy to merge this, but I am against of working based only on results. Just like I did not like previous approach (we ran some tests and dividing by 2 seemed to be the best - spoiler: it was not), I don't like changing the logic because some new tests suggest it's better. I think we also need some analysis on why is better. I like what we did last time when we computed a table and checked the mapping of different algorithms. I think this PR is lacking precisely that, to understand how it behaves compared to what we had. I can take care of looking at this and post the results here.
I think this does not implement the original ticket, which was aimed at doing a deep dive on this and understanding what is the best approach for XCD remapping. HipKittens approach might be the best remapping actually, but we lack the analysis to prove it (related to my previous point).

I have an excel file with the remapping. I tried to explain in the PR description why I think this is a better approach. I think I can prepare a presentation to explain it.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-15T10:02:21Z

mlir/lib/Dialect/Rock/Transforms/GridLayoutEmitter.cpp

+  int block = numChiplets * chunkSize;
+  int limit = (gridSize / block) * block;


The local variables block and limit are declared as int, but they are derived from int64_t parameters (gridSize, numChiplets, chunkSize). This could lead to integer overflow if the values are large (e.g., gridSize > INT_MAX / numChiplets). Consider changing these to int64_t for consistency and to prevent potential overflow issues.

Suggested change

int block = numChiplets * chunkSize;

int limit = (gridSize / block) * block;

int64_t block = numChiplets * chunkSize;

int64_t limit = (gridSize / block) * block;

dhernandez0 · 2026-01-08T09:59:55Z

We can't merge this PR yet because it fails on gfx950:

bin/rocmlir-gen -g 1 -m 32768 -k 32 -n 32769 --transA=false --transB=false --transC=true -t f4E2M1FN -scale_a_dtype f8E8M0FNU -scale_b_dtype f8E8M0FNU --schedule_version  2  --operation gemm -out_dtype f32 --scaledGemm --arch gfx950:sramecc+:xnack- -pv_with_gpu   | bin/rocmlir-driver -c -gO 3 | external/llvm-project/llvm/bin/mlir-runner -O2 --shared-libs=external/llvm-project/llvm/lib/libmlir_rocm_runtime.so,lib/libconv-validation-wrappers.so,external/llvm-project/llvm/lib/libmlir_runner_utils.so,external/llvm-project/llvm/lib/libmlir_float16_utils.so --entry-point-result=void

Changing it to -gO 0 makes the command pass.

dhernandez0 self-assigned this Dec 9, 2025

dhernandez0 requested a review from causten as a code owner December 9, 2025 10:39

dhernandez0 changed the title ~~Change XCD remapping logic~~ [DRAFT] Change XCD remapping logic Dec 9, 2025

dhernandez0 changed the title ~~[DRAFT] Change XCD remapping logic~~ Change XCD remapping logic Dec 10, 2025

dhernandez0 requested review from Copilot, justinrosner, pabloantoniom and umangyadav December 10, 2025 15:50

Copilot started reviewing on behalf of dhernandez0 December 10, 2025 15:52 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

justinrosner approved these changes Dec 10, 2025

View reviewed changes

pabloantoniom reviewed Dec 11, 2025

View reviewed changes

Base automatically changed from greedy_third_iteration to develop December 12, 2025 14:05

dhernandez0 force-pushed the xcd_remapping branch from 35de766 to e7acb82 Compare December 12, 2025 14:20

dhernandez0 requested a review from Copilot December 15, 2025 09:55

Copilot started reviewing on behalf of dhernandez0 December 15, 2025 09:57 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

Change XCD remapping logic

dff5e38

dhernandez0 force-pushed the xcd_remapping branch from e73c1c2 to dff5e38 Compare January 8, 2026 09:48

		// CHECK-DAG: %[[GROUD_ID:.+]] = arith.divui %[[NEW_BID2]], %c240 : index
		// CHECK-DAG: %[[FIRST_BID_M:.+]] = arith.muli %[[GROUD_ID]], %c6 : index

		int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) *
		(bitWidthOut / bitWidthIn);

-  int64_t groupSize = std::ceil(std::sqrt(info.numCU / numChiplets)) *
-                      (bitWidthOut / bitWidthIn);
+  int64_t groupSize = static_cast<int64_t>(
+      std::ceil(std::sqrt(static_cast<double>(info.numCU) / numChiplets) *
+                (static_cast<double>(bitWidthOut) / bitWidthIn)));

		// CHECK-DAG: %[[IS_LARGER_THAN_FULLBLCOK:.+]] = arith.cmpi sgt, %[[BID]], %c576 : index
		// CHECK-DAG: %[[NEW_BID:.+]] = arith.select %[[IS_LARGER_THAN_FULLBLCOK]], %[[BID]], %[[MAYBE_NEW_BID]] : index

		int block = numChiplets * chunkSize;
		int limit = (gridSize / block) * block;

Change XCD remapping logic #2161

Are you sure you want to change the base?

Change XCD remapping logic #2161

Conversation

dhernandez0 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Perf data on MI350

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

justinrosner Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

justinrosner left a comment

Choose a reason for hiding this comment

Uh oh!

justinrosner Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

pabloantoniom left a comment

Choose a reason for hiding this comment

Uh oh!

pabloantoniom Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

dhernandez0 commented Dec 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

dhernandez0 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

dhernandez0 commented Dec 9, 2025 •

edited

Loading