Skip to content

Commit 322c32f

Browse files
authored
Merge pull request #22 from yacchin1205/feature/grdm-volumes
[GRDM-55309] paths.yaml対応とJupyterHub連携機能の追加
2 parents 65f5856 + f77c16d commit 322c32f

13 files changed

Lines changed: 1454 additions & 175 deletions

File tree

dev-requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
build
22
conda-lock
33
pre-commit
4+
pytest-asyncio
45
pytest-cov
56
pytest>=7
67
pyyaml

repo2docker/buildpacks/base.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -750,7 +750,11 @@ def get_preassemble_scripts(self):
750750
@lru_cache()
751751
def get_assemble_scripts(self):
752752
"""Return directives to run after the entire repository has been added to the image"""
753-
return []
753+
scripts = []
754+
prepare_mnt = self.binder_path("prepare_mnt.sh")
755+
if os.path.exists(prepare_mnt):
756+
scripts.append(("root", f"bash {prepare_mnt}"))
757+
return scripts
754758

755759
@lru_cache()
756760
def get_post_build_scripts(self):

repo2docker/contentproviders/rdm.py

Lines changed: 0 additions & 156 deletions
This file was deleted.
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# RDM Content Provider
2+
3+
## Overview
4+
5+
This content provider is used to create a Jupyter notebook server from a GakuNin RDM project.
6+
7+
## Configuration
8+
9+
### Folder Mapping Configuration File: `paths.yaml`
10+
11+
When setting up an analysis environment from a GakuNin RDM project, the `paths.yaml` file can be used to specify which files should be copied into the image and which should be symbolically linked. This file explicitly lists file paths and directories and defines whether each path should be copied or symlinked.
12+
13+
The `paths.yaml` file serves a similar purpose to the `fstab` file in Linux, mapping folders in the GakuNin RDM project to appropriate locations within the image. It should be placed in the `.binder` or `binder` directory, and the image builder (`repo2docker`) will prioritize loading from these locations to automatically apply the necessary copy and symlink settings.
14+
15+
The syntax of `paths.yaml` is based on the `volumes` section in Docker Compose file specifications:
16+
https://docs.docker.com/reference/compose-file/volumes/
17+
18+
An example `paths.yaml` is shown below:
19+
20+
```yaml
21+
override: true
22+
paths:
23+
- type: copy
24+
source: $default_storage_path/custom-home-dir
25+
target: .
26+
- type: link
27+
source: /googledrive/subdir
28+
target: ./external/googledrive
29+
```
30+
31+
In the example above, `$default_storage_path/custom-home-dir` is copied to the root directory of the image, and `/googledrive/subdir` is symlinked to `./external/googledrive` within the image.
32+
33+
The `paths.yaml` file is written in YAML format as a dictionary. The top-level dictionary must include the following elements:
34+
35+
* `override`: Set to `true` to disable the default folder mapping (which copies the default storage content to the current directory). If omitted, it is treated as `false`.
36+
* `paths`: A list defining how each file or folder should be handled. Each item is a dictionary specifying the behavior for a specific path.
37+
38+
When `paths.yaml` is present, repo2docker renders the mapping into a `provision.sh` script inside the same `.binder` (or `binder`) directory in the built image. The script contains the `cp` and `ln -s` commands derived from each entry and is executed during container start-up to materialize the requested copy/link layout in the runtime home directory.
39+
40+
#### Elements in the `paths` List
41+
42+
Each item in the `paths` list is a dictionary containing the following keys:
43+
44+
* `type`: Specifies the operation to apply to the folder. Must be either `copy` (copies files from the source) or `link` (creates a symbolic link).
45+
* `source`: The path to the file/folder within the GakuNin RDM project. For example, to specify a folder named `testdir` in a Google Drive storage provider, use `/googledrive/testdir`. The variable `$default_storage_path` can be used to refer to the project’s default storage (note: the default storage is not necessarily `osfstorage`, depending on the institution).
46+
* `target`: Specifies where the file/folder should be placed in the analysis environment. This must be a relative path from the output directory (the home directory when the environment starts). To explicitly indicate a relative path, only paths starting with `.` or `./` are allowed.
47+
48+
> Absolute paths are not allowed for `target`, to prevent the injection of unauthorized executables into the `repo2docker` environment.
49+
50+
If no `paths.yaml` is provided, the default behavior is as follows:
51+
52+
```yaml
53+
paths:
54+
- type: copy
55+
source: $default_storage_path
56+
target: .
57+
```
58+
59+
## Running provision.sh with JupyterHub
60+
61+
When deploying repo2docker-built images with JupyterHub, you can automatically execute the `provision.sh` script at container startup to provision RDM data.
62+
63+
### Background Execution to Avoid Timeout
64+
65+
Since copying large datasets may take time and cause JupyterHub spawn timeout, the `provision.sh` script supports background execution mode. When called with command-line arguments, it will:
66+
67+
1. Start provisioning (copy/link operations) in the background
68+
2. Immediately execute the passed command (e.g., `jupyterhub-singleuser`)
69+
70+
This allows the JupyterHub server to start while data provisioning continues in the background.
71+
72+
### JupyterHub Configuration
73+
74+
Configure your JupyterHub spawner to execute `provision.sh` if it exists:
75+
76+
#### KubeSpawner Example
77+
78+
```python
79+
# In jupyterhub_config.py
80+
c.KubeSpawner.cmd = [
81+
'bash', '-c',
82+
'''
83+
set -e
84+
85+
# Find and execute provision.sh if it exists
86+
for path in \
87+
"${REPO_DIR}/binder/provision.sh" \
88+
"${REPO_DIR}/.binder/provision.sh" \
89+
"$HOME/binder/provision.sh" \
90+
"$HOME/.binder/provision.sh"; do
91+
92+
if [ -f "$path" ]; then
93+
echo "[provision-wrapper] Executing: $path" >&2
94+
exec bash "$path" "$@"
95+
fi
96+
done
97+
98+
# No provision.sh found, start normally
99+
exec "$@"
100+
''',
101+
'--', 'jupyterhub-singleuser'
102+
]
103+
```
104+
105+
#### DockerSpawner Example
106+
107+
```python
108+
# In jupyterhub_config.py
109+
c.DockerSpawner.cmd = [
110+
'bash', '-c',
111+
'''
112+
set -e
113+
for path in \
114+
"${REPO_DIR}/binder/provision.sh" \
115+
"${REPO_DIR}/.binder/provision.sh" \
116+
"$HOME/binder/provision.sh" \
117+
"$HOME/.binder/provision.sh"; do
118+
[ -f "$path" ] && exec bash "$path" "$@"
119+
done
120+
exec "$@"
121+
''',
122+
'--', 'jupyterhub-singleuser'
123+
]
124+
```
125+
126+
### How provision.sh Works
127+
128+
The generated `provision.sh` script accepts command-line arguments and has the following structure:
129+
130+
```bash
131+
#!/bin/bash
132+
set -e
133+
134+
# Run provisioning in background
135+
{
136+
# Copy and link operations
137+
mkdir -p './target/path/'
138+
cp -fr '/mnt/rdm/storage/data/'* './target/path/'
139+
ln -s '/mnt/rdm/large-data/' './data'
140+
} &
141+
142+
# Execute passed command if provided
143+
if [ $# -gt 0 ]; then
144+
exec "$@"
145+
fi
146+
```
147+
148+
**Note**: If `/mnt/rdm/` does not exist but `/mnt/rdms/{project_id}/` is available, the build process or provisioning process will automatically create a symlink from `/mnt/rdm` to `/mnt/rdms/{project_id}`.
149+
150+
### Monitoring Provisioning Progress
151+
152+
Users can check the provisioning progress from within the Jupyter environment:
153+
154+
```bash
155+
# View the provisioning log in real-time
156+
tail -f /tmp/provision.log
157+
158+
# Check if provisioning is complete
159+
grep "completed" /tmp/provision.log
160+
```
161+
162+
The log file `/tmp/provision.log` contains:
163+
- Start and completion timestamps
164+
- Each copy/link operation with source and target paths
165+
- Detailed command output (from `set -x`)
166+
- Any errors that occur during provisioning
167+
168+
### Notes
169+
170+
- The `REPO_DIR` environment variable points to the repository directory (default: `/home/jovyan`)
171+
- Provisioning runs in the background, so large data copies won't block JupyterHub startup
172+
- Symbolic links are created immediately and are available right away
173+
- Check `/tmp/provision.log` for provisioning progress and errors
174+
- Container logs will show when provisioning starts and how to monitor it

0 commit comments

Comments
 (0)