-
Notifications
You must be signed in to change notification settings - Fork 45
downloadFiles returns only root-level files (cache never populated) #3846
Description
Summary
On the GraphQL endpoint, the downloadFiles field on Snapshot is documented as returning a "single list of files to download this snapshot," but in practice it only returns root-level files — never files inside subdirectories. This forces clients like openneuro-py to recursively call files(tree: ...) one directory at a time, resulting in an N+1 query pattern for every dataset download.
Expected behavior
{
snapshot(datasetId: "ds000117", tag: "1.0.4") {
downloadFiles {
filename
urls
size
}
}
}Should return a flat list of all files in the snapshot, recursively — as described in #2797.
Actual behavior
The query returns only root-level files. For example:
| Dataset | Root-level files returned | Subdirectories (not traversed) |
|---|---|---|
| ds000001 | 5 files (all root-level) | 16 subdirectories |
| ds000117 | 26 files (all root-level) | 19 subdirectories |
| ds003104 | 5 files (all root-level) | 3 subdirectories |
No returned filename contains a path separator — subdirectory contents are never included.
Root cause
downloadFiles calls getFilesRecursive in datalad/snapshots.ts, which recursively calls getFiles in datalad/files.ts. However, getFiles has a doNotCache guard:
openneuro/packages/openneuro-server/src/datalad/files.ts
Lines 109 to 113 in fb46610
| // Skip caching this tree if it doesn't contain S3 URLs - likely still exporting | |
| if (!f.directory && !f.urls[0].includes("s3.amazonaws.com")) { | |
| doNotCache(true) | |
| break | |
| } |
When files don't yet have S3 URLs (which appears to be common), doNotCache(true) is called, and the recursive results are never persisted to the Redis cache. On subsequent requests, the cache miss means only the root-level files are returned.
This was introduced in #2797 (April 2023), which describes the feature as "a new flat API call that is indefinitely cached once a snapshot is fully exported." The caching precondition (all files having S3 URLs) appears to rarely be met in practice.
Impact on clients
Without a working downloadFiles, clients must manually traverse the directory tree:
- Query
files(tree: null)to get root entries - For each directory, query
files(tree: "dirname") - Recurse into subdirectories
Each directory in the dataset tree requires its own GraphQL query. For real-world BIDS datasets, this adds up fast:
| Dataset | Subjects | Total files | Est. directories | GraphQL queries needed | With working downloadFiles |
|---|---|---|---|---|---|
| ds000001 | 16 | 133 | ~49 | ~50 | 1 |
| ds000117 | 17 | 1,156 | ~355 | ~355 | 1 |
| ds000228 | 155 | 626 | ~622 | ~622 | 1 |
| ds002345 | 345 | 4,164 | ~1,037 | ~1,037 | 1 |
At ~200 ms per round-trip, even with 10-way parallel fetching (as implemented in openneuro-py PR #289), metadata enumeration for ds002345 takes on the order of 20 seconds before any file download can begin. Without parallelization, it would be ~3.5 minutes of pure metadata queries.
Parallelization also can't fully solve this: children aren't discoverable until their parent directory has been queried. For deep trees like FreeSurfer derivatives (depth 6: derivatives/freesurfer/sub-XX/ses-mri/anat/label/), that's a minimum of 6 sequential round-trips before leaf files are even known. The tree must be traversed level by level.
Server-side impact
This isn't just a client-side inconvenience. Every openneuro-py download of ds002345 sends ~1,037 GraphQL requests to the OpenNeuro API. Multiply that by the number of users and CI jobs downloading datasets, and the N+1 pattern generates significant unnecessary load on the API server. A working downloadFiles would reduce this to 1 request per download — an improvement by orders of magnitude for both clients and the server.
Suggestions
Option A: Fix downloadFiles
The per-directory files(tree: ...) endpoint already returns working proxy URLs (https://openneuro.org/crn/...) when S3 URLs aren't available. downloadFiles should do the same — return the recursive listing with whatever URLs are available, rather than silently returning only root-level files when the S3 export hasn't completed. The recursive traversal code in getFilesRecursive already exists and works; the issue is purely in the doNotCache guard rejecting results that lack S3 URLs.
Option B: Add a recursive argument to files
files(tree: String, recursive: Boolean): [DatasetFile]This would let clients opt into a flat recursive listing without relying on the separate downloadFiles field.
Why not use the git endpoint?
The official JS CLI uses isomorphic-git to clone datasets, sidestepping the GraphQL file-listing API entirely. However, for third-party tools like openneuro-py, the GraphQL API is the better fit:
-
Single-step download experience. openneuro-py downloads data files directly via the URLs provided by the GraphQL API. By contrast, the JS CLI is a two-step process by design: it clones the git tree, then tells users to separately run
datalad getorgit-annex getto fetch the actual data files. -
No git infrastructure dependency. Users of openneuro-py don't need git, git-annex, or DataLad installed — just
uv add openneuro-py/uvx openneuro-py. This matters especially in environments like HPC clusters or CI pipelines where installing git-annex may not be straightforward. -
The GraphQL API is the documented programmatic interface. The
filesanddownloadFilesfields exist specifically for this use case. Since the JS CLI sidesteps these code paths entirely, regressions like this broken cache go undetected — there's no internal consumer exercising the feature.
Do you have any interest in helping implement the feature?
Yes, but I would need guidance on the caching architecture in datalad/files.ts.
Additional information / screenshots
No response.
cc @larsoner