Skip to content

Fix FSDP+DP Transformer Engine error#344

Merged
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
cpersson-amd:main
Mar 3, 2026
Merged

Fix FSDP+DP Transformer Engine error#344
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
cpersson-amd:main

Conversation

@cpersson-amd
Copy link
Contributor

This PR fixes a bug that prevents the use of FSDP and DP simultaneously. This is done by not passing the DP parallelism axis to the transformer engine mesh resource. I believe this restriction is in place for the transformer engine MultiHeadAttention class and does not apply to the DotProductAttention class used in this repo. The code has been tested on flux and wan training/inference.

This PR also applies the transformer_engine_context to most of the training/generate scripts and cleans up some previous code.

@copybara-service copybara-service bot merged commit 68e0696 into AI-Hypercomputer:main Mar 3, 2026
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants