-
Couldn't load subscription status.
- Fork 22
Description
CUDA 13 release notes states that "cuda-checkpoint utility updated to allow GPU migration. Users can now specify how GPUs from the old and new machines should be matched by specifying UUID pairs."
I'm trying to use the new '--device-map' option to restore to a different GPU (same model, L4) on the same system. I keep getting 'invalid argument'. If I set the new UUID to be the same as old UUID, it works. Any ideas? Thanks!
[root@localhost ~]# ./cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --action restore --pid 1100100 -d GPU-eea0d83f-ebcd-ac27-6b3e-bec2a19724e6=GPU-8f4022ba-e0cb-1298-286e-225c953a7feb
Could not restore on process ID 1100100: "invalid argument"
[root@localhost ~]# ./cuda-checkpoint/bin/x86_64_Linux/cuda-checkpoint --action restore --pid 1100100 -d GPU-eea0d83f-ebcd-ac27-6b3e-bec2a19724e6=GPU-eea0d83f-ebcd-ac27-6b3e-bec2a19724e6
[root@localhost ~]#
[root@localhost ~]# nvidia-smi -L
GPU 0: NVIDIA L4 (UUID: GPU-eea0d83f-ebcd-ac27-6b3e-bec2a19724e6)
GPU 1: NVIDIA L4 (UUID: GPU-8f4022ba-e0cb-1298-286e-225c953a7feb)
GPU 2: NVIDIA L4 (UUID: GPU-5449fdd5-fb83-f4ed-37a3-ea1057d82160)
GPU 3: NVIDIA L4 (UUID: GPU-6fbeda44-9d0c-67a9-1dd4-44a22fb5ff2c)
[root@localhost ~]# cuda-checkpoint --help
CUDA checkpoint and restore utility.
Version 580.65.06. Copyright (C) 2025 NVIDIA Corporation. All rights reserved.
Operations:
--get-state --pid
Prints the current checkpoint state of the process specified by
--action lock | checkpoint | restore | unlock --pid [--timeout ] [--device-map ]
Performs the specified action on .
For the lock action a timeout can be provided, the lock operation will wait up to milliseconds for the operation to succeed.
For the restore action a device map can be provided in the format oldUuid=newUuid,oldUuid=newUuid,... which will be used to remap old devices to new devices.
--toggle --pid
Toggles the CUDA state in the specified process between the running and checkpointed states
--get-restore-tid --pid
Retrieves the CUDA restore thread ID of the process specified by
Options:
--pid|-p
The pid upon which to perform the operation
--timeout|-t
Optional timeout that can be specified for the lock action in milliseconds
--device-map|-d
Optional device map used during the restore action to manually remap old devices onto new ones.
is a comma delimited list in the format oldUuid1=newUuid1,oldUuid2=newUuid2,...
Must contain all checkpointed devices if specified.
--help|-h
Print this help message