Skip to content

Separate gpu jobs#18

Closed
homework36 wants to merge 26 commits intomainfrom
separate-gpu-jobs
Closed

Separate gpu jobs#18
homework36 wants to merge 26 commits intomainfrom
separate-gpu-jobs

Conversation

@homework36
Copy link

@homework36 homework36 commented Jul 30, 2025

I don't think the current folder structure is 100% correct (in this PR) and it does not match docker-compose here either

dockerfile: ./celery/gpu-workers/Dockerfile

(This will be an ongoing PR for further work

I will make a separate clean branch later once everything... just need to use CI to test image build with some PR
This is getting too messy and I don't think I will merge this one at all.

@homework36
Copy link
Author

homework36 commented Aug 5, 2025

Current problem:

celery-gpu-workers:
build:
context: .
dockerfile: ./celery/gpu-workers/Dockerfile

docker-compose still assumes that all GPU jobs share the same Docker container built from the same image, while we are using multiple ones. We need to rewrite this block. The easiest way is to repeat this block four times, but it would be very redundant and cumbersome.

New blocker:

We no longer have this line COPY ./rodan-main/code /code/Rodan from https://github.com/DDMAL/Rodan/blob/develop/gpu-celery/Dockerfile#L157 so anything related must be fixed

For example, in background removal, we have RUN pip3 install -r /code/Rodan/requirements.txt

@homework36
Copy link
Author

#65 ERROR: failed to register layer: write /root/.cache/pip/http/4/e/e/3/1/4ee3138b60ee0c1b5c7c6d39d2830adba9f08a2f0a1a300dfb7d80b0: no space left on device

ci job is running out of space!

@homework36
Copy link
Author

#167 [backend-django  8/11] RUN /opt/install_gpu_rodan_jobs
#167 0.114 + which pip3
#167 0.115 + PIP=/usr/local/bin/pip3
#167 0.115 + cd /code/Rodan/code/jobs
#167 0.115 /opt/install_gpu_rodan_jobs: 8: cd: can't cd to /code/Rodan/code/jobs
#167 ERROR: process "/bin/sh -c /opt/install_gpu_rodan_jobs" did not complete successfully: exit code: 2
------
 > [backend-django  8/11] RUN /opt/install_gpu_rodan_jobs:
0.114 + which pip3
0.115 + PIP=/usr/local/bin/pip3
0.115 + cd /code/Rodan/code/jobs
0.115 /opt/install_gpu_rodan_jobs: 8: cd: can't cd to /code/Rodan/code/jobs
------
Dockerfile.bak:45

Do we still need to install gpu jobs on backend-django?

@notkaramel
Copy link
Collaborator

Do we still need to install gpu jobs on backend-django?

I still haven't figured out how to build backend-django yet, you can ignore it.
I'm just surprised why the backend CI is running

@homework36
Copy link
Author

not sure if clean up helps to free up space on ci runner

=== Disk usage before cleanup ===
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   46G   27G  64% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb16      881M   60M  760M   8% /boot
/dev/sdb15      105M  6.2M   99M   6% /boot/efi
/dev/sda1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
Total reclaimed space: 0B
=== Disk usage after cleanup ===
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        72G   36G   36G  50% /
tmpfs           7.9G   84K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb16      881M   60M  760M   8% /boot
/dev/sdb15      105M  6.2M   99M   6% /boot/efi
/dev/sda1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

@homework36 homework36 marked this pull request as draft August 12, 2025 17:15
@homework36 homework36 closed this Aug 12, 2025
@notkaramel notkaramel deleted the separate-gpu-jobs branch August 13, 2025 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants