[AMD] Improve Scheduling for Async BF16 GEMM by raikonenfnu · Pull Request #802 · ROCm/triton

raikonenfnu · 2025-05-21T05:53:55Z

Use single AsyncWait (1030 -> 1070)
Move local load before global load to hide latency (1076)
Move slice local load(3) to the cluster before dot(3) (1080.5)
Update clusterBarrier to schedBarrier + s_barrier + schedBarrier (1086)

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

4-stage FA experiment Cluster assignment

…anonicalize can fold it

…m is contiguous

…ory ops cluster

…it barrier from Membar

…usters

…ct to lds loads

…der in the loop

I'll submit a PR upstream later.

This reverts commit 718ee20.

Initial support over already arranged ops.

…ction based on the loop. This is not meant as a permanent solution just to make this branch useable for other workloads

Computation part interleaves mfma and ds_read Placed extra conditional barrier to overlap computation part and buffer_load part. Dot slicing by plognjen at https://github.com/plognjen/triton/tree/slice_dot_scaled requires vmcnt fix to achieve full performance.

Fix incorrect condition to choose enable transforms. Fix missing tokens to the local_load

Only enable for 256x256x256 tilesize

…950 (triton-lang#6744)" This reverts commit 4ecc496.

…#6754)" This reverts commit f3076b1.

… BufferLoadToLocal to avoid implicit barrier from Membar" This reverts commit 012793a.

* [AMD] Generalize PingPong to have different type of Load/Store Ops This main motivation behind this commit is to add support for PingPong with AsyncOps. In order to accomplish that we made these changes: - Fork "determineDotMemoryOps" to "determineDotAsyncMemoryOps" that handles async memory ops. - Refactor validation and pruning of memory ops to "pruneDotMemoryOps" S.T we can have clean interface for it's async memory ops counterpart "pruneAsyncDotMemoryOps". - Plumb "useBlockPingpong" into StreamPipeliner S.T it can adjust AsyncWait stage/cluster to hoist first AsyncWait and allow set AsyncWait towards the end of the loop to make it easier for 4 PP cluster to move it before the 3rd dot-slice / 2 s_barrier before localLoads this is to ensure no race conditions. - Add check to enable handling of dotSOps (dot scaled) VS dotOps (dot) Signed-off-by: Stanley Winata <stanley.winata@amd.com> Co-authored-by: Alexander Weinrauch <alexander.weinrauch@amd.com>

Overlapping buffer_load and local_load+dot

…t on GFX950 (triton-lang#6744)"" This reverts commit f6065b9.

…ton-lang#6844) This commit improves how we create the mfma-like layout for optimizing global store by using linear layout composition. Along the way fixes a few implemenation issues. --------- Co-authored-by: Yi Qian <yi.qian@amd.com>

the backend

avoid wrongly enabled.

Requirement to enable the transform : mxfp4, 128x128x512 tile size, async_copy, num_stages=2, num_warps=8

- Use single AsyncWait (1030 -> 1070) - Move local load before global load to hide latency (1076) - Move slice local load(3) to the cluster before dot(3) (1080.5) - Update clusterBarrier to schedBarrier + s_barrier + schedBarrier (1086) Signed-off-by: Stanley Winata <stanley.winata@amd.com>

jungpark-mlir

[WIP]
Nothing wrong with this code but this requires async_copy slicing to achieve its full performance.
Holding until the related issue is addressed. We'll be working on the planB.

AlexAUT and others added 30 commits May 13, 2025 17:19

[FA] 4-stage FA pipeliner

826bda0

4-stage FA experiment Cluster assignment

[FA] Add FA scripts

c35e297

[FA] Place cvt layout in the same stage and cluster as LocalLoad so c…

06cf75a

…anonicalize can fold it

[ASYNC_COPY] Add env var to bypass permute, only works if the load di…

203fe11

…m is contiguous

[FA] Do not combine AsyncWaits to have a barrier in front of each mem…

b664353

…ory ops cluster

[ASYNC_COPY] Remove MemoryEffect of BufferLoadToLocal to avoid implic…

012793a

…it barrier from Membar

[FA] Compute max before mul QK_SCALE to fold sub into fma

3b74f4a

[FA] Added 2 extra clusters to have async_waits in front of memory cl…

b059372

…usters

[FA] Place LocalLoads before AsyncCopies

f3bb293

[FA][ASYNC_COPY] Force vec=8 for shared encodings to avoid 32bit dire…

77884fa

…ct to lds loads

[FA] Place dots at the top of clusters

b2e2ad0

[FA] Split 4-stage clusters into 8 clusters to better controll the or…

fab1281

…der in the loop

[FA] Revert order change in SM clusters

e0ea5e7

[FA] Set vecSize=nonKDim for V shared layout to avoid bank conflicts

1198462

I'll submit a PR upstream later.

[FA] Removed old vectorSize workaround

fb186d4

[FA] Revert "Place AsyncWait at the top of schedule"

3212481

This reverts commit 718ee20.

[FA][PINGPONG] Add support for FAv3 pingpong.

34beed7

Initial support over already arranged ops.

[FA][PINGPONG] Allow block pingpong with num_stages==4

3861063

[FA][PINGPONG] Bail out if async wait count != 2

fc6d1d9

[FA] Do not pipeline second loop (causal)

d6a0419

[FA] Split FourStagePipeliner to separate file and do very basic sele…

1d1e8cc

…ction based on the loop. This is not meant as a permanent solution just to make this branch useable for other workloads

[GEMM] Add combine dot_scaled and addF

d46f750

[GEMM] Do not swizzle the scale

4a5ece6

Add layout conversion pass optim at the end

8285bfc

Fix to the gemm pingpong.

b3c2f94

Fix incorrect condition to choose enable transforms. Fix missing tokens to the local_load

Add restriction to dot_scaled pingpong.

a19dd6d

Only enable for 256x256x256 tilesize

Revert "[AMD] Use v_permlane to optimize MFAM to linear layout on GFX…

f6065b9

…950 (triton-lang#6744)" This reverts commit 4ecc496.

Revert "[BACKEND] bump to llvm/llvm-project@3c709802d31b (triton-lang…

d7e2e2c

…#6754)" This reverts commit f3076b1.

Revert because no longer needed: "[ASYNC_COPY] Remove MemoryEffect of…

247f4f4

… BufferLoadToLocal to avoid implicit barrier from Membar" This reverts commit 012793a.

raikonenfnu and others added 9 commits May 15, 2025 08:24

Add initial support for skinny mxfp gemm

c5c0e67

Overlapping buffer_load and local_load+dot

add AB load separated pingpong for skinny gemm.

bcc871d

Revert "Revert "[AMD] Use v_permlane to optimize MFAM to linear layou…

33f6ce9

…t on GFX950 (triton-lang#6744)"" This reverts commit f6065b9.

[ASYNCCOPY] Simplify swizzling calculations to get better codegen from

aebdfd7

the backend

Code cleanup

6527f10

avoid wrongly enabled.

Add skinny pingpong transform

1082cd2

Requirement to enable the transform : mxfp4, 128x128x512 tile size, async_copy, num_stages=2, num_warps=8

[FA] Disable pipelining for causal loop

5c4b1fb

raikonenfnu requested review from antiagainst and zhanglx13 as code owners May 21, 2025 05:53

raikonenfnu force-pushed the raikonenfnu/BetterAsyncBF16Schedule branch from 02d3120 to 8c3abdc Compare May 21, 2025 06:03

jungpark-mlir requested changes May 21, 2025

View reviewed changes

antiagainst force-pushed the shared/triton-gfx950-launch branch from 77c00fa to a259f0a Compare May 26, 2025 17:58

raikonenfnu mentioned this pull request May 27, 2025

[AMD] Improve Scheduling for Async BF16 GEMM #812

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Improve Scheduling for Async BF16 GEMM#802

[AMD] Improve Scheduling for Async BF16 GEMM#802
raikonenfnu wants to merge 40 commits intoshared/triton-gfx950-launchfrom
raikonenfnu/BetterAsyncBF16Schedule

raikonenfnu commented May 21, 2025

Uh oh!

jungpark-mlir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

raikonenfnu commented May 21, 2025

Uh oh!

jungpark-mlir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants