Skip to content

Deep Learning

Alex Hurt edited this page Nov 17, 2022 · 6 revisions

One of the many compute applications to which nautilus is well suited is training and testing deep learning models. This page describes the steps necessary to train and evaluate deep learning models on the Nautilus cluster

Step 0: Prerequisites

You will need a persistent volume with capacity of at least 50GB, but likely more for checkpoints and training/testing data

Step 1: Creating the Container

You will need to create the container will all necessary packages for your preferred deep learning framework, including the scripts used to perform training and testing.

You will most likely need to use a Custom Container. There are several Dockerfiles available for frameworks like Detectron2, MMDetection, PyTorch, and MMSegmentation in docker directory of this repo.

Step 2: Running a Training Job

Once you have a container placed on a public facing registry, such as Docker Hub or Nautilus' GitLab, you can use a standard Kube YAML Spec to train your model, but with some key differences:

  1. Ensure you have set your workingDir to a place on the persistent volume storage, so that you can access the trained model after training has completed
  2. Set the requests and limits for NVIDIA GPUs to the same number. It is recommended to use 1, 2 or 4
  3. Open the port you need for DDP if you are using Distributed training
  4. Create a shared memory RAM volume and attach it to your container for PyTorch to use multiple workers and accelerate training

There is a sample PyTorch training YAML in the repo here.

Step 3: Creating a GPU Pod for Inference

After your model has been trained, you will likely need to run evaluation or inference with your model. It is recommended because this process usually takes < 30 minutes to use a pod with a GPU attached.

You will need to complete all the same steps from the training of the model, but ensure that your command is sleep infinity and that you have only requested the amount of resources allowed for Pods (12 GB RAM, 2 CPU, 1 GPU).

There is a sample Kube Spec for a GPU pod here.

Step 4: Copying Results Out of Nautilus

If you want to use your trained checkpoints outside of Nautilus, it is recommended to use a cloud storage bucket for transfer rather than kubectl cp if the size of the data is > 1 GB. There is information on doing that here

Clone this wiki locally