Skip to content

Conversation

@misohu
Copy link
Member

@misohu misohu commented Apr 24, 2024

closes: #39

User now can create GPU backed notebooks in the cluster.
Create is also blocking infinitely until the notebook is created or the image is unpullable.

NOTE: in order to test this you need a device with NVIDIA gpu set up in microk8s. I have tested on my device.

microk8s setup:

sudo snap install microk8s --channel=1.28/stable --classic
microk8s enable storage dns rbac gpu 
microk8s config > ~/.kube/config
dss initialize 
dss create test-nb --image kubeflownotebookswg/jupyter-scipy:v1.8.0 --kubeconfig ~/.kube/config --gpu=nvidia
dss list

Other minor Changes:

  • create now waits infinitely until its finished
  • wait_for_deployment can now wait infinitely

@misohu misohu requested a review from a team as a code owner April 24, 2024 08:35
@click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.")
@click.option(
"--gpu",
type=click.Choice(SUPPORTED_GPUS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be case sensitive?

Suggested change
type=click.Choice(SUPPORTED_GPUS),
type=click.Choice(SUPPORTED_GPUS, case_sensitive=False),

help=f"Path to a Kubernetes config file. Defaults to the value of the KUBECONFIG environment variable, else to '{KUBECONFIG_DEFAULT}'.", # noqa E501
)
def create_notebook_command(name: str, image: str, kubeconfig: str) -> None:
@click.option("--no-gpu", is_flag=True, help="Create a notebook without GPU support.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a --no-gpu? Can the absence of --gpu=something imply no gpu, and we get rid of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its recommendation of ux team to provide this --no-gpu flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a miscommunication here. If I understand correctly, these two commands are both the same?

dss create my-notebook
dss create my-notebook --no-gpu

Am I missing something about the CLI?

Comment on lines +26 to +30
{% if gpu %}
resources:
limits:
{{ gpu }}: 1
{% endif %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support only a single GPU per notebook? If I have two GPUs, should we support using both?

self.msg = str(msg)


def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asserts that all nodes have the provided labels, not that a given node has gpu labels.

Alternatively you could include the gpu labels here in the function code drop the labels input

Suggested change
def node_has_gpu_labels(lightkube_client: Client, labels: List[str]) -> bool:
def all_nodes_have_labels(lightkube_client: Client, labels: List[str]) -> bool:

name (str): The name of the notebook server.
image (str): The image used for the notebook server.
lightkube_client (Client): The Kubernetes client.
image (str): The Docker image used for the notebook server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave as image or oci image instead of docker? just to not exclude rocks

)
logger.info(f"Success: Notebook {name} created successfully.")
if gpu:
logger.info(f"{gpu.title()} GPU attached to notebook.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion. title() wont be correct in all cases (ex: Amd). I'd stick to the enumerators we say gpu should be

Suggested change
logger.info(f"{gpu.title()} GPU attached to notebook.")
logger.info(f"{gpu} GPU attached to notebook.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the gpu requests are working as expected. Try this:

dss initialize --kubeconfig ~/.kube/config 
dss create gpu-omitted --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create gpu-selected --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config

Then in each notebook server:

  • create a terminal and do nvidia-smi. The GPU should be visible only in one notebook server, but it is visible in both
  • create a notebook and run import torch; torch.cuda.is_available(). The GPU should be available only in one notebook server, but it is available in both
  • create a notebook and run this tutorial. Both will run on the GPU. After doing this, run nvidia-smi on the host system and we'll see two processes both using the GPU, like:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   37C    P0              36W /  70W |    280MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     70671      C   /opt/conda/bin/python                       158MiB |
|    0   N/A  N/A     96171      C   /opt/conda/bin/python                       118MiB |
+---------------------------------------------------------------------------------------+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is working correctly is the GPU requests. My machine has one gpu, and if I do:

dss create x ... --gpu=nvidia
dss create y ... --gpu=nvidia

notebook y sits pending with the FailedScheduling warning: 1 Insufficient nvidia.com/gpu

Copy link
Contributor

@ca-scribner ca-scribner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comments

Tested by using an g4dn.xlarge ec2 instance in AWS. Had to bump the storage to ~150GB to run multiple notebooks. VM was set up with:

sudo snap install microk8s --channel 1.28/stable --classic
sudo usermod -a -G microk8s ubuntu
mkdir ~/.kube
sudo chown -R ubuntu ~/.kube
newgrp microk8s

# This starts a new terminal

microk8s enable storage dns rbac gpu 
microk8s config > ~/.kube/config
git clone http://github.com/canonical/data-science-stack
cd data-science-stack/
git checkout --track origin/KF-5420-add-gpu-flag-for-create
pip install -e .
python --version
nvidia-smi 
dss initialize --kubeconfig ~/.kube/config 
dss create nothing-specified --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config
dss create with-gpu --image=kubeflownotebookswg/jupyter-pytorch-cuda-full:latest --kubeconfig ~/.kube/config --gpu=nvidia

From there, I enabled a socks proxy from my local machine (ssh -D 9999 -C -N -i PEM_FILE EC2_INSTANCE) and made sample notebooks out of this pytorch example to test if the gpu was working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option to create-notebook to use GPUs

3 participants