Fix FSDP+DP Transformer Engine error by cpersson-amd · Pull Request #344 · AI-Hypercomputer/maxdiffusion

cpersson-amd · 2026-03-03T15:52:24Z

This PR fixes a bug that prevents the use of FSDP and DP simultaneously. This is done by not passing the DP parallelism axis to the transformer engine mesh resource. I believe this restriction is in place for the transformer engine MultiHeadAttention class and does not apply to the DotProductAttention class used in this repo. The code has been tested on flux and wan training/inference.

This PR also applies the transformer_engine_context to most of the training/generate scripts and cleans up some previous code.

Apply TE shard_guard to train/generate scripts

e15f3ce

cpersson-amd requested a review from entrpn as a code owner March 3, 2026 15:52

entrpn added the gemini-review label Mar 3, 2026

entrpn approved these changes Mar 3, 2026

View reviewed changes

entrpn added pull ready and removed pull ready labels Mar 3, 2026

copybara-service bot merged commit 68e0696 into AI-Hypercomputer:main Mar 3, 2026
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP+DP Transformer Engine error#344

Fix FSDP+DP Transformer Engine error#344
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
cpersson-amd:main

cpersson-amd commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cpersson-amd commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants