Skip to content

[AMD] Improve Scheduling for Async BF16 GEMM#802

Open
raikonenfnu wants to merge 40 commits intoshared/triton-gfx950-launchfrom
raikonenfnu/BetterAsyncBF16Schedule
Open

[AMD] Improve Scheduling for Async BF16 GEMM#802
raikonenfnu wants to merge 40 commits intoshared/triton-gfx950-launchfrom
raikonenfnu/BetterAsyncBF16Schedule

Conversation

@raikonenfnu
Copy link
Member

  • Use single AsyncWait (1030 -> 1070)
  • Move local load before global load to hide latency (1076)
  • Move slice local load(3) to the cluster before dot(3) (1080.5)
  • Update clusterBarrier to schedBarrier + s_barrier + schedBarrier (1086)

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

AlexAUT and others added 30 commits May 13, 2025 17:19
4-stage FA experiment

Cluster assignment
Initial support over already arranged ops.
…ction based on the loop. This is not meant as a permanent solution just to make this branch useable for other workloads
Computation part interleaves mfma and ds_read
Placed extra conditional barrier to overlap computation part
and buffer_load part. Dot slicing by plognjen at https://github.com/plognjen/triton/tree/slice_dot_scaled
requires vmcnt fix to achieve full performance.
Fix incorrect condition to choose enable transforms.
Fix missing tokens to the local_load
Only enable for 256x256x256 tilesize
… BufferLoadToLocal to avoid implicit barrier from Membar"

This reverts commit 012793a.
raikonenfnu and others added 9 commits May 15, 2025 08:24
* [AMD] Generalize PingPong to have different type of Load/Store Ops

This main motivation behind this commit is to add support for PingPong
with AsyncOps. In order to accomplish that we made these changes:
- Fork "determineDotMemoryOps" to "determineDotAsyncMemoryOps" that handles async memory ops.
- Refactor validation and pruning of memory ops to "pruneDotMemoryOps"
  S.T we can have clean interface for it's async memory ops counterpart
  "pruneAsyncDotMemoryOps".
- Plumb "useBlockPingpong" into StreamPipeliner S.T it can adjust AsyncWait
  stage/cluster to hoist first AsyncWait and allow set AsyncWait towards
  the end of the loop to make it easier for 4 PP cluster to move it
  before the 3rd dot-slice / 2 s_barrier before localLoads
  this is to ensure no race conditions.
- Add check to enable handling of dotSOps (dot scaled) VS dotOps (dot)

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Co-authored-by: Alexander Weinrauch <alexander.weinrauch@amd.com>
Overlapping buffer_load and local_load+dot
…ton-lang#6844)

This commit improves how we create the mfma-like layout for
optimizing global store by using linear layout composition.
Along the way fixes a few implemenation issues.

---------

Co-authored-by: Yi Qian <yi.qian@amd.com>
avoid wrongly enabled.
Requirement to enable the transform
: mxfp4, 128x128x512 tile size, async_copy, num_stages=2, num_warps=8
- Use single AsyncWait (1030 -> 1070)
- Move local load before global load to hide latency (1076)
- Move slice local load(3) to the cluster before dot(3)  (1080.5)
- Update clusterBarrier to schedBarrier + s_barrier + schedBarrier
  (1086)

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
@raikonenfnu raikonenfnu force-pushed the raikonenfnu/BetterAsyncBF16Schedule branch from 02d3120 to 8c3abdc Compare May 21, 2025 06:03
Copy link

@jungpark-mlir jungpark-mlir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WIP]
Nothing wrong with this code but this requires async_copy slicing to achieve its full performance.
Holding until the related issue is addressed. We'll be working on the planB.

@antiagainst antiagainst force-pushed the shared/triton-gfx950-launch branch from 77c00fa to a259f0a Compare May 26, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants