Skip to content

Conversation

@manman-ren
Copy link
Contributor

@manman-ren manman-ren commented Oct 30, 2025

Summary: Copied from Hongtao's TLX implementation in third_party/tlx/tutorials/blackwell-fa-ws-pipelined-persistent_test.py

Test Plan:
python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --only triton_tutorial_flash_v2_persistent_blackwell --bwd --force --metrics tflops

Reviewers:

Subscribers:

Tasks:

Tags:

@meta-cla meta-cla bot added the cla signed label Oct 30, 2025
@manman-ren manman-ren changed the title [Blackwell] add non-causal bwd/FA [Blackwell] add non-causal bwd/FA with TMA and atomic_add Oct 30, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 30, 2025

@manman-ren has imported this pull request. If you are a Meta employee, you can view this in D85880773.

Copy link
Contributor

@htyu htyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@njriasan njriasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

for blk_idx in range(num_steps):
q = desc_q.load([(off_bh + curr_m).to(tl.int32), 0])
qT = tl.trans(q)
# Load m before computing qk to reduce pipeline stall.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant/required with WS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on whether the 1D load of m is in the load partition or not. We can load m first, then wait for qk.

@xuzhao9
Copy link
Contributor

xuzhao9 commented Oct 30, 2025

Can you help run ufmt format . to fix the linting error?

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren manman-ren merged commit f9168e4 into meta-pytorch:main Nov 7, 2025
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants