Skip to content

RuntimeError: Peer-to-peer device memory access is not supported after upgrading driver 555 → 570 on Azure NC96ads_A100_v4 (4 × NVIDIA A100) #191

@geghamv

Description

@geghamv

Description:
After upgrading the NVIDIA driver from 555 to 570 on an Azure Standard NC96ads_A100_v4 Linux VM (4 × NVIDIA A100), my cuQuantum-based workload (qsimcirq) fails at startup with the following error:

RuntimeError: Peer-to-peer device memory access is not supported

This regression only appears after moving to driver 570 — the same workload ran fine under 555.

Repro details:

Environment: Ubuntu VM on Azure

VM SKU: Standard NC96ads_A100_v4 (4 × NVIDIA A100)

Driver: upgraded from 555 → 570

cuQuantum: cuQuantum Appliance 25.06 (FROM nvcr.io/nvidia/cuquantum-appliance:25.06-x86_64)

Library: qsimcirq

Trace snippet:

RuntimeError: Peer-to-peer device memory access is not supported
  File "qsimcirq/qsim_simulator.py", line 262, in __init__
    qsim_mgpu.qsim_initialize_devices(gpu_mode)

Topology info:

nvidia-smi topo -m

GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	SYS	SYS	0-23	0		N/A
GPU1	NV12	 X 	SYS	SYS	24-47	1		N/A
GPU2	SYS	SYS	 X 	NV12	48-71	2		N/A
GPU3	SYS	SYS	NV12	 X 	72-95	3		N/A

nvidia-smi topo -p2p w

 GPU0	GPU1	GPU2	GPU3	
 GPU0	X	OK	NS	NS	
 GPU1	OK	X	NS	NS	
 GPU2	NS	NS	X	OK	
 GPU3	NS	NS	OK	X	

Question:
Is there a way to work around this issue without downgrading the NVIDIA driver? It looks like with driver 570, P2P over PCIe is no longer enabled when there is no NVLINK connection between GPU pairs. As a result, cuQuantum (via qsimcirq) fails to initialize because it assumes P2P support across all devices.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions