Skip to content

Using Persistent Storage

Alex Hurt edited this page Nov 17, 2022 · 4 revisions

The storage built into pods and jobs will be deleted when the pod/job are deleted, so for data that you want to stage on Nautilus or for artifacts from compute jobs, you will need to create and attach a persistent storage volume to your pods and jobs.

Creating Persistent Storage

To create persistent storage, you need to make a persistent volume claim, or PVC.

You can use the template present in the repo here. Be sure to change the your_name to the name of the persistent storage volume that you want.

For capacity, 500GB is a good starting point, and keep in mind that it can be increased to larger capacity at a later time if necessary.

Once you have configured your .yml file, you can create the storage by running:

kubectl create -f FILENAME.yml

Using Persistent Storage

Once you have created your PVC, you will need to attach it to your pods and jobs, and update the output paths of your compute processes to those attached volumes.

To do this, you will need to include the volume in your pod/job spec and then specify a path to where you want your data mounted:

spec:
  containers:
    - name: pod-name-sso # YOUR CONTAINER NAME HERE
      volumeMounts:
        - mountPath: /data
          name: persistentVolume-name # YOUR PVC NAME HERE
  volumes:
    - name: persistentVolume-name # YOUR PVC NAME HERE
      persistentVolumeClaim:
        claimName: persistentVolume-name # YOUR PVC NAME HERE

An example kube spec YAML file can be found in the repo here.

Copying Data to Persistent Storage

Now that you have a persistent volume created and know how to use it in a pod or job, you will need to put the data you want to process on that volume.

Open Source ZIP and TAR

For publically available .zip and .tar files, you can use ubuntu pod and install wget or curl to get the data onto your volume.

Accessing Public Cloud Storage

If you are using data stored in pubilc cloud storage buckets, you can use the corresponding command line tool from that cloud vendor to copy your data from the cloud storage bucket to your volume. There are sample Kube YAML for both Google Cloud and AWS

Using Custom Data

If you want to use a custom dataset not readily available in the public domain, the recommended method is to use either the Nautilus provided S3 storage or a personal Google Cloud Bucket.

The steps to use a personal Google Cloud Bucket are as follows:

  1. Create private Google Cloud Bucket (documentation here)
  2. Copy the data from your local machine to the bucket: gsutil cp MYDATA gs://MYBUCKET
  3. Create a pod on the nautilus cluster with the gsutil command line tool and your PVC attacked, as shown in this YAML spec
  4. Open a terminal to your GCP pod: kubectl exec -it MYPOD -- /bin/bash
  5. Login to your Google Account from the pod: gsutil configure
  6. Copy the data from the bucket to your persistent volume: gsutil cp gs://MYBUCKET/MYDATA /MYVOLUME/
  7. Delete the data from the bucket: gsutil rm gs://MYBUCKET/MYDATA

To use the provided Nautilus S3 storage, you can follow the steps outlined on the Using Nautilus S3 wiki page.

Clone this wiki locally