enable DDA AR, AG, RS, A2A by default, and check NRANKS to avoid DDA assert error #131

liukeOoO · 2026-01-07T19:34:59Z

Summary:
In this diff:

we enable all DDA collectives (AllReduce, AllGather, ReduceScatter, AllToAll) by default
To avoid the assert error inside DDA (DDA only supports NRANKS=8), we add the check in AlgoInit.h to disable DDA when comm.nRanks != 8.
Fix failed test: rccl_allreduce_perf_bench
- Root-cause: "zgpu_benchmark" --> D90052113
- In zgpu_benchmark, it sets rccl-tests args "-g 8" which means 8 GPUs per thread, IPC (inter-process communication) does not compile with this scheme
- So, set "-g 1" in zgpu_benchmark to fix it

Mast jobs: https://fburl.com/network/o2c0v5ov

ToDo: make DDA support nRanks < 8 case?

Differential Revision: D89249175

meta-codesync · 2026-01-07T19:35:06Z

@liukeOoO has exported this pull request. If you are a Meta employee, you can view the originating Diff in D89249175.

…assert error (meta-pytorch#131) Summary: In this diff: - we enable all DDA collectives (AllReduce, AllGather, ReduceScatter, AllToAll) by default - To avoid the assert error inside DDA (DDA only supports NRANKS=8), we add the check in AlgoInit.h to disable DDA when comm.nRanks != 8. - Fix failed test: rccl_allreduce_perf_bench --> error "invalid device context" (show in V1) - Root-cause: "zgpu_benchmark" --> D90052113 - In zgpu_benchmark, it sets rccl-tests args "-g 8" which means 8 GPUs per thread, IPC (inter-process communication) does not compile with this scheme - So, we set "-g 1" in zgpu_benchmark to fix it Mast jobs: https://fburl.com/network/o2c0v5ov ToDo: make DDA support nRanks < 8 case? Reviewed By: dmwu Differential Revision: D89249175

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 7, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 7, 2026

liukeOoO force-pushed the export-D89249175 branch from 213a6d7 to b28414f Compare January 8, 2026 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable DDA AR, AG, RS, A2A by default, and check NRANKS to avoid DDA assert error #131

enable DDA AR, AG, RS, A2A by default, and check NRANKS to avoid DDA assert error #131

Uh oh!

liukeOoO commented Jan 7, 2026

Uh oh!

meta-codesync bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

enable DDA AR, AG, RS, A2A by default, and check NRANKS to avoid DDA assert error #131

Are you sure you want to change the base?

enable DDA AR, AG, RS, A2A by default, and check NRANKS to avoid DDA assert error #131

Uh oh!

Conversation

liukeOoO commented Jan 7, 2026

Uh oh!

meta-codesync bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant