|
| 1 | +# RDM Content Provider |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This content provider is used to create a Jupyter notebook server from a GakuNin RDM project. |
| 6 | + |
| 7 | +## Configuration |
| 8 | + |
| 9 | +### Folder Mapping Configuration File: `paths.yaml` |
| 10 | + |
| 11 | +When setting up an analysis environment from a GakuNin RDM project, the `paths.yaml` file can be used to specify which files should be copied into the image and which should be symbolically linked. This file explicitly lists file paths and directories and defines whether each path should be copied or symlinked. |
| 12 | + |
| 13 | +The `paths.yaml` file serves a similar purpose to the `fstab` file in Linux, mapping folders in the GakuNin RDM project to appropriate locations within the image. It should be placed in the `.binder` or `binder` directory, and the image builder (`repo2docker`) will prioritize loading from these locations to automatically apply the necessary copy and symlink settings. |
| 14 | + |
| 15 | +The syntax of `paths.yaml` is based on the `volumes` section in Docker Compose file specifications: |
| 16 | +https://docs.docker.com/reference/compose-file/volumes/ |
| 17 | + |
| 18 | +An example `paths.yaml` is shown below: |
| 19 | + |
| 20 | +```yaml |
| 21 | +override: true |
| 22 | +paths: |
| 23 | + - type: copy |
| 24 | + source: $default_storage_path/custom-home-dir |
| 25 | + target: . |
| 26 | + - type: link |
| 27 | + source: /googledrive/subdir |
| 28 | + target: ./external/googledrive |
| 29 | +``` |
| 30 | +
|
| 31 | +In the example above, `$default_storage_path/custom-home-dir` is copied to the root directory of the image, and `/googledrive/subdir` is symlinked to `./external/googledrive` within the image. |
| 32 | + |
| 33 | +The `paths.yaml` file is written in YAML format as a dictionary. The top-level dictionary must include the following elements: |
| 34 | + |
| 35 | +* `override`: Set to `true` to disable the default folder mapping (which copies the default storage content to the current directory). If omitted, it is treated as `false`. |
| 36 | +* `paths`: A list defining how each file or folder should be handled. Each item is a dictionary specifying the behavior for a specific path. |
| 37 | + |
| 38 | +When `paths.yaml` is present, repo2docker renders the mapping into a `provision.sh` script inside the same `.binder` (or `binder`) directory in the built image. The script contains the `cp` and `ln -s` commands derived from each entry and is executed during container start-up to materialize the requested copy/link layout in the runtime home directory. |
| 39 | + |
| 40 | +#### Elements in the `paths` List |
| 41 | + |
| 42 | +Each item in the `paths` list is a dictionary containing the following keys: |
| 43 | + |
| 44 | +* `type`: Specifies the operation to apply to the folder. Must be either `copy` (copies files from the source) or `link` (creates a symbolic link). |
| 45 | +* `source`: The path to the file/folder within the GakuNin RDM project. For example, to specify a folder named `testdir` in a Google Drive storage provider, use `/googledrive/testdir`. The variable `$default_storage_path` can be used to refer to the project’s default storage (note: the default storage is not necessarily `osfstorage`, depending on the institution). |
| 46 | +* `target`: Specifies where the file/folder should be placed in the analysis environment. This must be a relative path from the output directory (the home directory when the environment starts). To explicitly indicate a relative path, only paths starting with `.` or `./` are allowed. |
| 47 | + |
| 48 | +> Absolute paths are not allowed for `target`, to prevent the injection of unauthorized executables into the `repo2docker` environment. |
| 49 | + |
| 50 | +If no `paths.yaml` is provided, the default behavior is as follows: |
| 51 | + |
| 52 | +```yaml |
| 53 | +paths: |
| 54 | + - type: copy |
| 55 | + source: $default_storage_path |
| 56 | + target: . |
| 57 | +``` |
| 58 | + |
| 59 | +## Running provision.sh with JupyterHub |
| 60 | + |
| 61 | +When deploying repo2docker-built images with JupyterHub, you can automatically execute the `provision.sh` script at container startup to provision RDM data. |
| 62 | + |
| 63 | +### Background Execution to Avoid Timeout |
| 64 | + |
| 65 | +Since copying large datasets may take time and cause JupyterHub spawn timeout, the `provision.sh` script supports background execution mode. When called with command-line arguments, it will: |
| 66 | + |
| 67 | +1. Start provisioning (copy/link operations) in the background |
| 68 | +2. Immediately execute the passed command (e.g., `jupyterhub-singleuser`) |
| 69 | + |
| 70 | +This allows the JupyterHub server to start while data provisioning continues in the background. |
| 71 | + |
| 72 | +### JupyterHub Configuration |
| 73 | + |
| 74 | +Configure your JupyterHub spawner to execute `provision.sh` if it exists: |
| 75 | + |
| 76 | +#### KubeSpawner Example |
| 77 | + |
| 78 | +```python |
| 79 | +# In jupyterhub_config.py |
| 80 | +c.KubeSpawner.cmd = [ |
| 81 | + 'bash', '-c', |
| 82 | + ''' |
| 83 | + set -e |
| 84 | +
|
| 85 | + # Find and execute provision.sh if it exists |
| 86 | + for path in \ |
| 87 | + "${REPO_DIR}/binder/provision.sh" \ |
| 88 | + "${REPO_DIR}/.binder/provision.sh" \ |
| 89 | + "$HOME/binder/provision.sh" \ |
| 90 | + "$HOME/.binder/provision.sh"; do |
| 91 | +
|
| 92 | + if [ -f "$path" ]; then |
| 93 | + echo "[provision-wrapper] Executing: $path" >&2 |
| 94 | + exec bash "$path" "$@" |
| 95 | + fi |
| 96 | + done |
| 97 | +
|
| 98 | + # No provision.sh found, start normally |
| 99 | + exec "$@" |
| 100 | + ''', |
| 101 | + '--', 'jupyterhub-singleuser' |
| 102 | +] |
| 103 | +``` |
| 104 | + |
| 105 | +#### DockerSpawner Example |
| 106 | + |
| 107 | +```python |
| 108 | +# In jupyterhub_config.py |
| 109 | +c.DockerSpawner.cmd = [ |
| 110 | + 'bash', '-c', |
| 111 | + ''' |
| 112 | + set -e |
| 113 | + for path in \ |
| 114 | + "${REPO_DIR}/binder/provision.sh" \ |
| 115 | + "${REPO_DIR}/.binder/provision.sh" \ |
| 116 | + "$HOME/binder/provision.sh" \ |
| 117 | + "$HOME/.binder/provision.sh"; do |
| 118 | + [ -f "$path" ] && exec bash "$path" "$@" |
| 119 | + done |
| 120 | + exec "$@" |
| 121 | + ''', |
| 122 | + '--', 'jupyterhub-singleuser' |
| 123 | +] |
| 124 | +``` |
| 125 | + |
| 126 | +### How provision.sh Works |
| 127 | + |
| 128 | +The generated `provision.sh` script accepts command-line arguments and has the following structure: |
| 129 | + |
| 130 | +```bash |
| 131 | +#!/bin/bash |
| 132 | +set -e |
| 133 | +
|
| 134 | +# Run provisioning in background |
| 135 | +{ |
| 136 | + # Copy and link operations |
| 137 | + mkdir -p './target/path/' |
| 138 | + cp -fr '/mnt/rdm/storage/data/'* './target/path/' |
| 139 | + ln -s '/mnt/rdm/large-data/' './data' |
| 140 | +} & |
| 141 | +
|
| 142 | +# Execute passed command if provided |
| 143 | +if [ $# -gt 0 ]; then |
| 144 | + exec "$@" |
| 145 | +fi |
| 146 | +``` |
| 147 | + |
| 148 | +**Note**: If `/mnt/rdm/` does not exist but `/mnt/rdms/{project_id}/` is available, the build process or provisioning process will automatically create a symlink from `/mnt/rdm` to `/mnt/rdms/{project_id}`. |
| 149 | + |
| 150 | +### Monitoring Provisioning Progress |
| 151 | + |
| 152 | +Users can check the provisioning progress from within the Jupyter environment: |
| 153 | + |
| 154 | +```bash |
| 155 | +# View the provisioning log in real-time |
| 156 | +tail -f /tmp/provision.log |
| 157 | +
|
| 158 | +# Check if provisioning is complete |
| 159 | +grep "completed" /tmp/provision.log |
| 160 | +``` |
| 161 | + |
| 162 | +The log file `/tmp/provision.log` contains: |
| 163 | +- Start and completion timestamps |
| 164 | +- Each copy/link operation with source and target paths |
| 165 | +- Detailed command output (from `set -x`) |
| 166 | +- Any errors that occur during provisioning |
| 167 | + |
| 168 | +### Notes |
| 169 | + |
| 170 | +- The `REPO_DIR` environment variable points to the repository directory (default: `/home/jovyan`) |
| 171 | +- Provisioning runs in the background, so large data copies won't block JupyterHub startup |
| 172 | +- Symbolic links are created immediately and are available right away |
| 173 | +- Check `/tmp/provision.log` for provisioning progress and errors |
| 174 | +- Container logs will show when provisioning starts and how to monitor it |
0 commit comments