Skip to content

[Issue]: IB env will fail if export NVSHMEM_IB_TRAFFIC_CLASS #45

@moon657

Description

@moon657

How is this issue impacting you?

Application crash

Share Your Debug Logs

15:35:48.149 /opt/nvshmem/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:1393: non-zero status: 121 Error in mlx5dv_devx_obj_modify for INIT2RTR_QP with syndrome 764216
15:35:48.149 
15:35:48.149 /opt/nvshmem/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:3005: non-zero status: 7 ibgda_rc_init2rtr failed on RC #0.
15:35:48.149 /opt/nvshmem/nvshmem_src/src/host/transport/transport.cpp:420: non-zero status: 7 connect EPS failed 
15:35:48.149 
15:35:48.149 /opt/nvshmem/nvshmem_src/src/host/init/init.cu:1044: non-zero status: 7 nvshmem setup connections failed 

Steps to Reproduce the Issue

No response

NVSHMEM Version

3.3.9 + cuda12.4

Your platform details

ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:0002:7710
base lid: 0x7db
sm lid: 0x66
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand

Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:0002:703c
base lid: 0x7de
sm lid: 0x66
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 400 Gb/sec (4X NDR)
link_layer: InfiniBand

Error Message & Behavior

When I configure NVSHMEM_IB_TRAFFIC_CLASS in the InfiniBand (link_layer) environment, nvshmem initialization will fail and exit, seemingly due to a call to a related RoCE API?
unset it will be good

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions