-
Notifications
You must be signed in to change notification settings - Fork 27
feat: add uccl-ep to docker built and add USING_UEP flag to run_pretrain and runner hook #540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b672e0e
b87fa42
a271c26
3cf4ed8
62fcbd9
1d2b38d
d9263ad
4811b80
95d3ce4
07de3a4
9622cba
871adf9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -536,5 +536,9 @@ def validate_args_on_rocm(args): | |
| assert ( | ||
| args.moe_router_dtype == "fp32" | ||
| ), "DeepEP only supports float32 probs, please set `moe_router_dtype=fp32`" | ||
| if args.expert_model_parallel_size >= 16: | ||
| if ( | ||
| args.expert_model_parallel_size >= 16 | ||
| and os.getenv("PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND", "DEEP_EP") == "TURBO" | ||
| ): | ||
|
Comment on lines
+539
to
+542
|
||
| # Turbo DeepEP is not supported for CUs > 32 when using internode dispatch/combine. | ||
zhenhuang12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| assert args.turbo_deepep_num_cu <= 32, "Set `turbo_deepep_num_cu<=32` when using ep_size >= 16." | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -21,11 +21,9 @@ fi | |||||||||||||||||||||||||
| UCCL_DIR="/tmp/uccl" | ||||||||||||||||||||||||||
| UCCL_BUILD_DIR="${UCCL_BUILD_DIR:-/tmp/uccl_${HOSTNAME:-$(hostname)}}" | ||||||||||||||||||||||||||
| UCCL_REF="${UCCL_REF:-}" | ||||||||||||||||||||||||||
| GPU_ARCHS="${GPU_ARCHS:-gfx942;gfx950}" | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| LOG_INFO_RANK0 "[hook system] REBUILD_UCCL=1 → Building uccl in /tmp " | ||||||||||||||||||||||||||
| LOG_INFO_RANK0 " Build directory : ${UCCL_BUILD_DIR}" | ||||||||||||||||||||||||||
| LOG_INFO_RANK0 " GPU_ARCHS : ${GPU_ARCHS}" | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| if [ -d "$UCCL_DIR" ]; then | ||||||||||||||||||||||||||
| LOG_INFO_RANK0 "[hook system] Found existed uccl in /tmp, remove it" | ||||||||||||||||||||||||||
|
|
@@ -47,7 +45,7 @@ if [[ -n "$UCCL_REF" ]]; then | |||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | |
| LOG_INFO_RANK0 "[hook system] Building uccl ep" | |
| # Ensure deterministic ROCm arch selection for extension build | |
| if [[ -z "${PYTORCH_ROCM_ARCH:-}" ]]; then | |
| if [[ -n "${GPU_ARCHS:-}" ]]; then | |
| export PYTORCH_ROCM_ARCH="${GPU_ARCHS}" | |
| else | |
| # Fallback to a sane default ROCm arch if none is provided | |
| export PYTORCH_ROCM_ARCH="gfx90a" | |
| fi | |
| fi | |
| LOG_INFO_RANK0 "[hook system] Using PYTORCH_ROCM_ARCH='${PYTORCH_ROCM_ARCH}' for uccl ep build" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| #!/bin/bash | ||
| ############################################################################### | ||
| # Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved. | ||
| # | ||
| # See LICENSE for license information. | ||
| ############################################################################### | ||
| # | ||
| # System hook: enable using UEP settings. | ||
| # | ||
| # Trigger: | ||
| # export USING_UEP=1 | ||
| # | ||
| ############################################################################### | ||
|
|
||
|
|
||
| if [ "$USING_UEP" == "1" ]; then | ||
| LOG_INFO "USING_UEP is enabled, checking required packages..." | ||
|
|
||
| if ! python3 -m pip show uccl &>/dev/null || ! python3 -m pip show deep_ep &>/dev/null; then | ||
| LOG_ERROR "uccl is not installed! Please use pre-installed primus image or set REBUILD_UCCL=1." | ||
| exit 1 | ||
| fi | ||
| LOG_INFO "uccl package is installed: $(python3 -m pip show uccl | grep Version)" | ||
| LOG_INFO "deep_ep package is installed: $(python3 -m pip show deep_ep | grep Version)" | ||
|
|
||
| if [ "$ENABLE_NUMA_BINDING" != "1" ]; then | ||
| LOG_WARN "ENABLE_NUMA_BINDING is not enabled! Please set ENABLE_NUMA_BINDING=1 to avoid dataloader worker exited unexpectedly." | ||
| fi | ||
|
|
||
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=DEEP_EP | ||
| LOG_INFO "PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to DEEP_EP" | ||
|
|
||
| # network settings for UCCL | ||
| export UCCL_IB_GID_INDEX=${UCCL_IB_GID_INDEX:-$NCCL_IB_GID_INDEX} | ||
| export UCCL_IB_HCA=${UCCL_IB_HCA:-$NCCL_IB_HCA} | ||
| export UCCL_SOCKET_IFNAME=${UCCL_SOCKET_IFNAME:-$NCCL_SOCKET_IFNAME} | ||
|
|
||
| # set low latency and normal inflight and bytes to avoid hang on AMD Pollara AI NIC and Broadcom Thor-2 | ||
| if [ "$USING_AINIC" == "1" ]; then | ||
| export UCCL_IB_MAX_INFLIGHT_NORMAL=${UCCL_IB_MAX_INFLIGHT_NORMAL:-1} | ||
| export UCCL_IB_MAX_INFLIGHT_LOW_LATENCY=${UCCL_IB_MAX_INFLIGHT_LOW_LATENCY:-1} | ||
| export UCCL_IB_MAX_INFLIGHT_BYTES=${UCCL_IB_MAX_INFLIGHT_BYTES:-4194304} # 4MB | ||
| elif [ "$REBUILD_BNXT" == "1" ]; then # Broadcom Thor-2 | ||
| # FIXME(zhuang12): use `USING_BNXT` for Broadcom Thor-2 maybe better than `REBUILD_BNXT` | ||
| export UCCL_IB_MAX_INFLIGHT_NORMAL=${UCCL_IB_MAX_INFLIGHT_NORMAL:-1} | ||
| export UCCL_IB_MAX_INFLIGHT_LOW_LATENCY=${UCCL_IB_MAX_INFLIGHT_LOW_LATENCY:-1} | ||
| export UCCL_IB_MAX_INFLIGHT_BYTES=${UCCL_IB_MAX_INFLIGHT_BYTES:-1572864} | ||
| fi | ||
|
|
||
|
|
||
| LOG_INFO "==========UCCL Network Settings==========" | ||
| LOG_INFO "UCCL_IB_GID_INDEX: $UCCL_IB_GID_INDEX" | ||
| LOG_INFO "UCCL_IB_HCA: $UCCL_IB_HCA" | ||
| LOG_INFO "UCCL_SOCKET_IFNAME: $UCCL_SOCKET_IFNAME" | ||
| LOG_INFO "UCCL_IB_MAX_INFLIGHT_NORMAL: $UCCL_IB_MAX_INFLIGHT_NORMAL" | ||
| LOG_INFO "UCCL_IB_MAX_INFLIGHT_LOW_LATENCY: $UCCL_IB_MAX_INFLIGHT_LOW_LATENCY" | ||
| LOG_INFO "UCCL_IB_MAX_INFLIGHT_BYTES: $UCCL_IB_MAX_INFLIGHT_BYTES" | ||
| LOG_INFO "" | ||
| else | ||
| export PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND=TURBO | ||
| LOG_INFO "USING_UEP is disabled. PRIMUS_TURBO_MOE_DISPATCH_COMBINE_BACKEND set to TURBO" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rebuild path clones UCCL from
mainwithout pinning to a commit/tag, which makes runs non-reproducible and can break unexpectedly over time. Consider supporting aUCCL_COMMIT/UCCL_REFenv var (similar to CI) and checking out that ref when provided.