Skip to content

cuda-checkpoint hangs during container checkpoint with k3s/containerd #2775

@Cusox

Description

@Cusox

Description

I'm attempting to checkpoint a PyTorch container that utilizes an NVIDIA GPU using k3s ctr c checkpoint, the process fails.

This issue seems specific to checkpointing within the container runtime, as using CRIU on a host process works correctly.

I run command:

sudo k3s ctr c checkpoint --task --rw <CONTAINER_ID> ckpt

but got following errors, the checkpoint command hangs and fails with a timeout error.:

(00.005028) Preparing image inventory (version 1)
(00.005051) Add pid ns 1 pid 2701895
(00.005059) Add net ns 2 pid 2701895
(00.005067) Add ipc ns 3 pid 2701895
(00.005074) Add uts ns 4 pid 2701895
(00.005082) Add time ns 5 pid 2701895
(00.005096) Add mnt ns 6 pid 2701895
(00.005105) Add user ns 7 pid 2701895
(00.005112) Add cgroup ns 8 pid 2701895
(00.005115) cg: Dumping cgroups for thread 2701895
(00.005132) cg:  `- New css ID 1
(00.005136) cg:     `- [] -> [/system.slice/k3s.service] [0]
(00.005138) cg: Set 1 is criu one
(00.005145) plugin: `cuda_plugin' hook 10 -> 0x71200b6dd6f3
(800.00523) Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
(800.00533) Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
(800.34507) Error (cuda_plugin.c:253): cuda_plugin: Failed to launch cuda-checkpoint to retrieve state:
(800.34511) Error (cuda_plugin.c:428): cuda_plugin: Failed to get CUDA state for PID 2680881
(800.34517) net: Unlock network
(800.34519) cuda_plugin: finished cuda_plugin stage 0 err -1
(800.34533) Unfreezing tasks into 1
(800.34534)     Unseizing 2679665 into 1
(800.34535) Error (compel/src/lib/infect.c:418): Unable to detach from 2679665: No such process
(800.34538) Error (criu/cr-dump.c:2111): Dumping FAILED.

While the checkpoint command is running, ps shows the cuda-checkpoint process is stalled:

$ ps aux | grep cuda-checkpoint
root     2701902  0.0  0.0 33955200 8704 ?       Sl   01:22   0:00 cuda-checkpoint --get-state --pid 2680881

and I ran this command manually, it's also stalled

I found a similar issue: NVIDIA/cuda-checkpoint#26, and set timeout 800 in /etc/criu/runc.conf, but still got same errors

When i use CRIU in host instead of container, the workload can be checkpoint / restore normally

Env

NVIDIA Driver Version: 570.86.10
CUDA Version: 12.8
CRIU Version: 4.1.1

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions