-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hello team, good day. I just wanted to check if there will be a possible support to run cuda-checkpoint in Pytorch with running Dataloader workers?
As far as I know, my concern looks like this issue but it looks like it didn't cover the workloads with dataloaders like below.
tini(32089)───python3(32149)─┬─pt_data_worker(32204)
├─pt_data_worker(32205)
├─pt_data_worker(32206)
└─pt_data_worker(32207)
Based on my testing, it seems that cuda-checkpoint can only toggle the status within the python parent process 32149, but for the other child processes, I cannot toggle them.
Sample lsof of the processes.
root@host:~# lsof -p 32149 | grep nvidia
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
python3 32149 root mem REG 252,0 2217912 20186302 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.133.20
root@host:~# lsof -p 32204 | grep nvidia
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
pt_data_w 32204 root mem REG 0,581 16 /dev/nvidia0 (path dev=0,5, inode=856)
pt_data_w 32204 root mem REG 0,581 15 /dev/nvidiactl (path dev=0,5, inode=852)
pt_data_w 32204 root mem REG 252,0 2217912 20186302 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.133.20
pt_data_w 32204 root 8u CHR 195,255 0t0 15 /dev/nvidiactl
pt_data_w 32204 root 9u CHR 234,0 0t0 13 /dev/nvidia-uvm
pt_data_w 32204 root 10u CHR 234,0 0t0 13 /dev/nvidia-uvm
pt_data_w 32204 root 11u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 12u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 13u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 14u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 17u CHR 195,255 0t0 15 /dev/nvidiactl
pt_data_w 32204 root 18u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 19u CHR 234,0 0t0 13 /dev/nvidia-uvm
pt_data_w 32204 root 20u CHR 234,0 0t0 13 /dev/nvidia-uvm
pt_data_w 32204 root 21u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 22u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 24u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 25u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 26u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 27u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 28u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 29u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 31u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 32u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 33u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 34u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 35u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 36u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 37u CHR 195,0 0t0 16 /dev/nvidia0
pt_data_w 32204 root 38u CHR 195,0 0t0 16 /dev/nvidia0
I assume it is due to current limitation of cuda-checkpoint pointed below?
does not support UVM or IPC memory
For normal python process without dataloaders, checkpoint and restore is working perfectly fine.
For possible work-around, I am looking for options besides totally removing the dataloader workers. Are there any other possible options? I think it is not possible to use dataloader separately from the main python thread?
For reference, I am using each below:
- Image: ngc25.03-pytorch
- Sample code: pytorch/examples repo
imagenet/main.py --dummy --epochs 500 - Runtime: Containerd v2.1.1 + runc v1.3.0
- NVIDIA Driver: 570.133.20
Thank you very much for your work!