-
Notifications
You must be signed in to change notification settings - Fork 10
Add --gpu option to the create command for NVIDIA support
#94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.") | ||
| @click.option( | ||
| "--gpu", | ||
| type=click.Choice(SUPPORTED_GPUS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be case sensitive?
| type=click.Choice(SUPPORTED_GPUS), | |
| type=click.Choice(SUPPORTED_GPUS, case_sensitive=False), |
| help=f"Path to a Kubernetes config file. Defaults to the value of the KUBECONFIG environment variable, else to '{KUBECONFIG_DEFAULT}'.", # noqa E501 | ||
| ) | ||
| def create_notebook_command(name: str, image: str, kubeconfig: str) -> None: | ||
| @click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a --no-gpu? Can the absence of --gpu=something imply no gpu, and we get rid of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its recommendation of ux team to provide this --no-gpu flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a miscommunication here. If I understand correctly, these two commands are both the same?
dss create my-notebook
dss create my-notebook --no-gpu
Am I missing something about the CLI?
| {% if gpu %} | ||
| resources: | ||
| limits: | ||
| {{ gpu }}: 1 | ||
| {% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support only a single GPU per notebook? If I have two GPUs, should we support using both?
| self.msg = str(msg) | ||
|
|
||
|
|
||
| def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This asserts that all nodes have the provided labels, not that a given node has gpu labels.
Alternatively you could include the gpu labels here in the function code drop the labels input
| def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool: | |
| def all_nodes_have_labels(lightkube_client: Client, labels: List[str]) -> bool: |
| name (str): The name of the notebook server. | ||
| image (str): The image used for the notebook server. | ||
| lightkube_client (Client): The Kubernetes client. | ||
| image (str): The Docker image used for the notebook server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leave as image or oci image instead of docker? just to not exclude rocks
| ) | ||
| logger.info(f"Success: Notebook {name} created successfully.") | ||
| if gpu: | ||
| logger.info(f"{gpu.title()} GPU attached to notebook.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion. title() wont be correct in all cases (ex: Amd). I'd stick to the enumerators we say gpu should be
| logger.info(f"{gpu.title()} GPU attached to notebook.") | |
| logger.info(f"{gpu} GPU attached to notebook.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the gpu requests are working as expected. Try this:
dss initialize --kubeconfig ~/.kube/config
dss create gpu-omitted --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create gpu-selected --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
Then in each notebook server:
- create a terminal and do
nvidia-smi. The GPU should be visible only in one notebook server, but it is visible in both - create a notebook and run
import torch; torch.cuda.is_available(). The GPU should be available only in one notebook server, but it is available in both - create a notebook and run this tutorial. Both will run on the GPU. After doing this, run
nvidia-smion the host system and we'll see two processes both using the GPU, like:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 36W / 70W | 280MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 70671 C /opt/conda/bin/python 158MiB |
| 0 N/A N/A 96171 C /opt/conda/bin/python 118MiB |
+---------------------------------------------------------------------------------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is working correctly is the GPU requests. My machine has one gpu, and if I do:
dss create x ... --gpu=nvidia
dss create y ... --gpu=nvidia
notebook y sits pending with the FailedScheduling warning: 1 Insufficient nvidia.com/gpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comments
Tested by using an g4dn.xlarge ec2 instance in AWS. Had to bump the storage to ~150GB to run multiple notebooks. VM was set up with:
sudo snap install microk8s --channel 1.28/stable --classic
sudo usermod -a -G microk8s ubuntu
mkdir ~/.kube
sudo chown -R ubuntu ~/.kube
newgrp microk8s
# This starts a new terminal
microk8s enable storage dns rbac gpu
microk8s config > ~/.kube/config
git clone http://github.com/canonical/data-science-stack
cd data-science-stack/
git checkout --track origin/KF-5420-add-gpu-flag-for-create
pip install -e .
python --version
nvidia-smi
dss initialize --kubeconfig ~/.kube/config
dss create nothing-specified --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create with-gpu --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config --gpu=nvidia
From there, I enabled a socks proxy from my local machine (ssh -D 9999 -C -N -i PEM_FILE EC2_INSTANCE) and made sample notebooks out of this pytorch example to test if the gpu was working
closes: #39
User now can create GPU backed notebooks in the cluster.
Create is also blocking infinitely until the notebook is created or the image is unpullable.
NOTE: in order to test this you need a device with NVIDIA gpu set up in microk8s. I have tested on my device.
microk8s setup:
Other minor Changes:
createnow waits infinitely until its finishedwait_for_deploymentcan now wait infinitely