`cuda-checkpoint` hangs during container checkpoint with k3s/containerd

## Description

I'm attempting to checkpoint a PyTorch container that utilizes an NVIDIA GPU using `k3s ctr c checkpoint`, the process fails.

This issue seems specific to checkpointing within the container runtime, as using CRIU on a host process works correctly.

I run command: 
```
sudo k3s ctr c checkpoint --task --rw <CONTAINER_ID> ckpt
```
but got following errors, the checkpoint command hangs and fails with a timeout error.:
```
(00.005028) Preparing image inventory (version 1)
(00.005051) Add pid ns 1 pid 2701895
(00.005059) Add net ns 2 pid 2701895
(00.005067) Add ipc ns 3 pid 2701895
(00.005074) Add uts ns 4 pid 2701895
(00.005082) Add time ns 5 pid 2701895
(00.005096) Add mnt ns 6 pid 2701895
(00.005105) Add user ns 7 pid 2701895
(00.005112) Add cgroup ns 8 pid 2701895
(00.005115) cg: Dumping cgroups for thread 2701895
(00.005132) cg:  `- New css ID 1
(00.005136) cg:     `- [] -> [/system.slice/k3s.service] [0]
(00.005138) cg: Set 1 is criu one
(00.005145) plugin: `cuda_plugin' hook 10 -> 0x71200b6dd6f3
(800.00523) Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
(800.00533) Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
(800.34507) Error (cuda_plugin.c:253): cuda_plugin: Failed to launch cuda-checkpoint to retrieve state:
(800.34511) Error (cuda_plugin.c:428): cuda_plugin: Failed to get CUDA state for PID 2680881
(800.34517) net: Unlock network
(800.34519) cuda_plugin: finished cuda_plugin stage 0 err -1
(800.34533) Unfreezing tasks into 1
(800.34534)     Unseizing 2679665 into 1
(800.34535) Error (compel/src/lib/infect.c:418): Unable to detach from 2679665: No such process
(800.34538) Error (criu/cr-dump.c:2111): Dumping FAILED.
```

While the checkpoint command is running, `ps` shows the cuda-checkpoint process is stalled: 
```
$ ps aux | grep cuda-checkpoint
root     2701902  0.0  0.0 33955200 8704 ?       Sl   01:22   0:00 cuda-checkpoint --get-state --pid 2680881
```
and I ran this command manually, it's also stalled

I found a similar issue: https://github.com/NVIDIA/cuda-checkpoint/issues/26, and set `timeout 800` in `/etc/criu/runc.conf`, but still got same errors

When i use `CRIU` in `host` instead of `container`, the workload can be checkpoint / restore normally

## Env
NVIDIA Driver Version: 570.86.10
CUDA Version: 12.8
CRIU Version: 4.1.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`cuda-checkpoint` hangs during container checkpoint with k3s/containerd #2775

Description

Env

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cuda-checkpoint hangs during container checkpoint with k3s/containerd #2775

Description

Description

Env

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`cuda-checkpoint` hangs during container checkpoint with k3s/containerd #2775