Skip to content

cuda-checkpoint Support for Pytorch with Dataloader Workers #28

@ZeroExistence

Description

@ZeroExistence

Hello team, good day. I just wanted to check if there will be a possible support to run cuda-checkpoint in Pytorch with running Dataloader workers?

As far as I know, my concern looks like this issue but it looks like it didn't cover the workloads with dataloaders like below.

tini(32089)───python3(32149)─┬─pt_data_worker(32204)
                             ├─pt_data_worker(32205)
                             ├─pt_data_worker(32206)
                             └─pt_data_worker(32207)

Based on my testing, it seems that cuda-checkpoint can only toggle the status within the python parent process 32149, but for the other child processes, I cannot toggle them.

Sample lsof of the processes.

root@host:~# lsof -p 32149 | grep nvidia
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.
python3 32149 root  mem       REG  252,0  2217912   20186302 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.133.20
root@host:~# lsof -p 32204 | grep nvidia
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.
pt_data_w 32204 root  mem       REG   0,581                  16 /dev/nvidia0 (path dev=0,5, inode=856)
pt_data_w 32204 root  mem       REG   0,581                  15 /dev/nvidiactl (path dev=0,5, inode=852)
pt_data_w 32204 root  mem       REG   252,0  2217912   20186302 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.133.20
pt_data_w 32204 root    8u      CHR 195,255      0t0         15 /dev/nvidiactl
pt_data_w 32204 root    9u      CHR   234,0      0t0         13 /dev/nvidia-uvm
pt_data_w 32204 root   10u      CHR   234,0      0t0         13 /dev/nvidia-uvm
pt_data_w 32204 root   11u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   12u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   13u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   14u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   17u      CHR 195,255      0t0         15 /dev/nvidiactl
pt_data_w 32204 root   18u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   19u      CHR   234,0      0t0         13 /dev/nvidia-uvm
pt_data_w 32204 root   20u      CHR   234,0      0t0         13 /dev/nvidia-uvm
pt_data_w 32204 root   21u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   22u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   24u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   25u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   26u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   27u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   28u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   29u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   31u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   32u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   33u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   34u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   35u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   36u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   37u      CHR   195,0      0t0         16 /dev/nvidia0
pt_data_w 32204 root   38u      CHR   195,0      0t0         16 /dev/nvidia0

I assume it is due to current limitation of cuda-checkpoint pointed below?

does not support UVM or IPC memory

For normal python process without dataloaders, checkpoint and restore is working perfectly fine.

For possible work-around, I am looking for options besides totally removing the dataloader workers. Are there any other possible options? I think it is not possible to use dataloader separately from the main python thread?

For reference, I am using each below:

  • Image: ngc25.03-pytorch
  • Sample code: pytorch/examples repo imagenet/main.py --dummy --epochs 500
  • Runtime: Containerd v2.1.1 + runc v1.3.0
  • NVIDIA Driver: 570.133.20

Thank you very much for your work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions