Skip to content

downloadFiles returns only root-level files (cache never populated) #3846

@hoechenberger

Description

@hoechenberger

Summary

On the GraphQL endpoint, the downloadFiles field on Snapshot is documented as returning a "single list of files to download this snapshot," but in practice it only returns root-level files — never files inside subdirectories. This forces clients like openneuro-py to recursively call files(tree: ...) one directory at a time, resulting in an N+1 query pattern for every dataset download.

Expected behavior

{
  snapshot(datasetId: "ds000117", tag: "1.0.4") {
    downloadFiles {
      filename
      urls
      size
    }
  }
}

Should return a flat list of all files in the snapshot, recursively — as described in #2797.

Actual behavior

The query returns only root-level files. For example:

Dataset Root-level files returned Subdirectories (not traversed)
ds000001 5 files (all root-level) 16 subdirectories
ds000117 26 files (all root-level) 19 subdirectories
ds003104 5 files (all root-level) 3 subdirectories

No returned filename contains a path separator — subdirectory contents are never included.

Root cause

downloadFiles calls getFilesRecursive in datalad/snapshots.ts, which recursively calls getFiles in datalad/files.ts. However, getFiles has a doNotCache guard:

// Skip caching this tree if it doesn't contain S3 URLs - likely still exporting
if (!f.directory && !f.urls[0].includes("s3.amazonaws.com")) {
doNotCache(true)
break
}

When files don't yet have S3 URLs (which appears to be common), doNotCache(true) is called, and the recursive results are never persisted to the Redis cache. On subsequent requests, the cache miss means only the root-level files are returned.

This was introduced in #2797 (April 2023), which describes the feature as "a new flat API call that is indefinitely cached once a snapshot is fully exported." The caching precondition (all files having S3 URLs) appears to rarely be met in practice.

Impact on clients

Without a working downloadFiles, clients must manually traverse the directory tree:

  1. Query files(tree: null) to get root entries
  2. For each directory, query files(tree: "dirname")
  3. Recurse into subdirectories

Each directory in the dataset tree requires its own GraphQL query. For real-world BIDS datasets, this adds up fast:

Dataset Subjects Total files Est. directories GraphQL queries needed With working downloadFiles
ds000001 16 133 ~49 ~50 1
ds000117 17 1,156 ~355 ~355 1
ds000228 155 626 ~622 ~622 1
ds002345 345 4,164 ~1,037 ~1,037 1

At ~200 ms per round-trip, even with 10-way parallel fetching (as implemented in openneuro-py PR #289), metadata enumeration for ds002345 takes on the order of 20 seconds before any file download can begin. Without parallelization, it would be ~3.5 minutes of pure metadata queries.

Parallelization also can't fully solve this: children aren't discoverable until their parent directory has been queried. For deep trees like FreeSurfer derivatives (depth 6: derivatives/freesurfer/sub-XX/ses-mri/anat/label/), that's a minimum of 6 sequential round-trips before leaf files are even known. The tree must be traversed level by level.

Server-side impact

This isn't just a client-side inconvenience. Every openneuro-py download of ds002345 sends ~1,037 GraphQL requests to the OpenNeuro API. Multiply that by the number of users and CI jobs downloading datasets, and the N+1 pattern generates significant unnecessary load on the API server. A working downloadFiles would reduce this to 1 request per download — an improvement by orders of magnitude for both clients and the server.

Suggestions

Option A: Fix downloadFiles

The per-directory files(tree: ...) endpoint already returns working proxy URLs (https://openneuro.org/crn/...) when S3 URLs aren't available. downloadFiles should do the same — return the recursive listing with whatever URLs are available, rather than silently returning only root-level files when the S3 export hasn't completed. The recursive traversal code in getFilesRecursive already exists and works; the issue is purely in the doNotCache guard rejecting results that lack S3 URLs.

Option B: Add a recursive argument to files

files(tree: String, recursive: Boolean): [DatasetFile]

This would let clients opt into a flat recursive listing without relying on the separate downloadFiles field.

Why not use the git endpoint?

The official JS CLI uses isomorphic-git to clone datasets, sidestepping the GraphQL file-listing API entirely. However, for third-party tools like openneuro-py, the GraphQL API is the better fit:

  1. Single-step download experience. openneuro-py downloads data files directly via the URLs provided by the GraphQL API. By contrast, the JS CLI is a two-step process by design: it clones the git tree, then tells users to separately run datalad get or git-annex get to fetch the actual data files.

  2. No git infrastructure dependency. Users of openneuro-py don't need git, git-annex, or DataLad installed — just uv add openneuro-py / uvx openneuro-py. This matters especially in environments like HPC clusters or CI pipelines where installing git-annex may not be straightforward.

  3. The GraphQL API is the documented programmatic interface. The files and downloadFiles fields exist specifically for this use case. Since the JS CLI sidesteps these code paths entirely, regressions like this broken cache go undetected — there's no internal consumer exercising the feature.

Do you have any interest in helping implement the feature?

Yes, but I would need guidance on the caching architecture in datalad/files.ts.

Additional information / screenshots

No response.

cc @larsoner

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions