-
Notifications
You must be signed in to change notification settings - Fork 51
Open
Description
How is this issue impacting you?
Application crash
Share Your Debug Logs
15:35:48.149 /opt/nvshmem/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1393: non-zero status: 121 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 764216
15:35:48.149
15:35:48.149 /opt/nvshmem/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 ibgda_rc_init2rtr failed on RC #0.
15:35:48.149 /opt/nvshmem/nvshmem_src/src/host/transport/transport.cpp:420: non-zero status: 7 connect EPS failed
15:35:48.149
15:35:48.149 /opt/nvshmem/nvshmem_src/src/host/init/init.cu:1044: non-zero status: 7 nvshmem setup connections failed
Steps to Reproduce the Issue
No response
NVSHMEM Version
3.3.9 + cuda12.4
Your platform details
ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:0002:7710
base lid: 0x7db
sm lid: 0x66
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:0002:703c
base lid: 0x7de
sm lid: 0x66
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand
Error Message & Behavior
When I configure NVSHMEM_IB_TRAFFIC_CLASS in the InfiniBand (link_layer) environment, nvshmem initialization will fail and exit, seemingly due to a call to a related RoCE API?
unset it will be good
Metadata
Metadata
Assignees
Labels
No labels