Skip to content

Conversation

@liukeOoO
Copy link

@liukeOoO liukeOoO commented Jan 7, 2026

Summary:
In this diff:

  • we enable all DDA collectives (AllReduce, AllGather, ReduceScatter, AllToAll) by default
  • To avoid the assert error inside DDA (DDA only supports NRANKS=8), we add the check in AlgoInit.h to disable DDA when comm.nRanks != 8.
  • Fix failed test: rccl_allreduce_perf_bench
    • Root-cause: "zgpu_benchmark" --> D90052113
    • In zgpu_benchmark, it sets rccl-tests args "-g 8" which means 8 GPUs per thread, IPC (inter-process communication) does not compile with this scheme
    • So, set "-g 1" in zgpu_benchmark to fix it

Mast jobs: https://fburl.com/network/o2c0v5ov

ToDo: make DDA support nRanks < 8 case?

Differential Revision: D89249175

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 7, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 7, 2026

@liukeOoO has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89249175.

…assert error (meta-pytorch#131)

Summary:

In this diff:
- we enable all DDA collectives (AllReduce, AllGather, ReduceScatter, AllToAll) by default
- To avoid the assert error inside DDA (DDA only supports NRANKS=8), we add the check in AlgoInit.h to disable DDA when comm.nRanks != 8.
- Fix failed test: rccl_allreduce_perf_bench --> error "invalid device context" (show in V1)
  - Root-cause: "zgpu_benchmark" --> D90052113
  - In zgpu_benchmark, it sets rccl-tests args "-g 8" which means 8 GPUs per thread, IPC (inter-process communication) does not compile with this scheme
  - So, we set "-g 1" in zgpu_benchmark to fix it

Mast jobs: https://fburl.com/network/o2c0v5ov

ToDo: make DDA support nRanks < 8 case?

Reviewed By: dmwu

Differential Revision: D89249175
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant