`downloadFiles` returns only root-level files (cache never populated)

### Summary

On the GraphQL endpoint, the `downloadFiles` field on `Snapshot` is documented as returning a "single list of files to download this snapshot," but in practice it only returns root-level files — never files inside subdirectories. This forces clients like [openneuro-py](https://github.com/openneuro-py/openneuro-py) to recursively call `files(tree: ...)` one directory at a time, resulting in an N+1 query pattern for every dataset download.

### Expected behavior

```graphql
{
  snapshot(datasetId: "ds000117", tag: "1.0.4") {
    downloadFiles {
      filename
      urls
      size
    }
  }
}
```

Should return a flat list of **all** files in the snapshot, recursively — as described in [#2797](https://github.com/OpenNeuroOrg/openneuro/pull/2797).

### Actual behavior

The query returns only root-level files. For example:

| Dataset | Root-level files returned | Subdirectories (not traversed) |
|---------|--------------------------|-------------------------------|
| ds000001 | 5 files (all root-level) | 16 subdirectories |
| ds000117 | 26 files (all root-level) | 19 subdirectories |
| ds003104 | 5 files (all root-level) | 3 subdirectories |

No returned filename contains a path separator — subdirectory contents are never included.

### Root cause

`downloadFiles` calls `getFilesRecursive` in `datalad/snapshots.ts`, which recursively calls `getFiles` in `datalad/files.ts`. However, `getFiles` has a `doNotCache` guard:

https://github.com/OpenNeuroOrg/openneuro/blob/fb46610d70caa5d56c2e37640732cbe8de60f1be/packages/openneuro-server/src/datalad/files.ts#L109-L113

When files don't yet have S3 URLs (which appears to be common), `doNotCache(true)` is called, and the recursive results are never persisted to the Redis cache. On subsequent requests, the cache miss means only the root-level files are returned.

This was introduced in [#2797](https://github.com/OpenNeuroOrg/openneuro/pull/2797) (April 2023), which describes the feature as "a new flat API call that is indefinitely cached once a snapshot is fully exported." The caching precondition (all files having S3 URLs) appears to rarely be met in practice.

### Impact on clients

Without a working `downloadFiles`, clients must manually traverse the directory tree:

1. Query `files(tree: null)` to get root entries
2. For each directory, query `files(tree: "dirname")`
3. Recurse into subdirectories

Each directory in the dataset tree requires its own GraphQL query. For real-world BIDS datasets, this adds up fast:

| Dataset | Subjects | Total files | Est. directories | GraphQL queries needed | With working `downloadFiles` |
|---------|:--------:|:-----------:|:----------------:|:----------------------:|:----------------------------:|
| ds000001 | 16 | 133 | ~49 | **~50** | **1** |
| ds000117 | 17 | 1,156 | ~355 | **~355** | **1** |
| ds000228 | 155 | 626 | ~622 | **~622** | **1** |
| ds002345 | 345 | 4,164 | ~1,037 | **~1,037** | **1** |

At ~200 ms per round-trip, even with 10-way parallel fetching (as implemented in [openneuro-py PR #289](https://github.com/openneuro-py/openneuro-py/pull/289)), metadata enumeration for ds002345 takes on the order of **20 seconds** before any file download can begin. Without parallelization, it would be **~3.5 minutes** of pure metadata queries.

Parallelization also can't fully solve this: children aren't discoverable until their parent directory has been queried. For deep trees like FreeSurfer derivatives (depth 6: `derivatives/freesurfer/sub-XX/ses-mri/anat/label/`), that's a minimum of 6 sequential round-trips before leaf files are even known. The tree must be traversed level by level.

#### Server-side impact

This isn't just a client-side inconvenience. Every `openneuro-py` download of ds002345 sends **~1,037 GraphQL requests** to the OpenNeuro API. Multiply that by the number of users and CI jobs downloading datasets, and the N+1 pattern generates significant unnecessary load on the API server. A working `downloadFiles` would reduce this to **1 request per download** — an improvement by orders of magnitude for both clients and the server.

### Suggestions

**Option A: Fix `downloadFiles`**

The per-directory `files(tree: ...)` endpoint already returns working proxy URLs (`https://openneuro.org/crn/...`) when S3 URLs aren't available. `downloadFiles` should do the same — return the recursive listing with whatever URLs are available, rather than silently returning only root-level files when the S3 export hasn't completed. The recursive traversal code in `getFilesRecursive` already exists and works; the issue is purely in the `doNotCache` guard rejecting results that lack S3 URLs.

**Option B: Add a `recursive` argument to `files`**

```graphql
files(tree: String, recursive: Boolean): [DatasetFile]
```

This would let clients opt into a flat recursive listing without relying on the separate `downloadFiles` field.

### Why not use the git endpoint?

The official JS CLI uses isomorphic-git to clone datasets, sidestepping the GraphQL file-listing API entirely. However, for third-party tools like openneuro-py, the GraphQL API is the better fit:

1. **Single-step download experience.** openneuro-py downloads data files directly via the URLs provided by the GraphQL API. By contrast, the JS CLI is a two-step process by design: it clones the git tree, then tells users to separately run `datalad get` or `git-annex get` to fetch the actual data files.

2. **No git infrastructure dependency.** Users of openneuro-py don't need git, git-annex, or DataLad installed — just `uv add openneuro-py` / `uvx openneuro-py`. This matters especially in environments like HPC clusters or CI pipelines where installing git-annex may not be straightforward.

3. **The GraphQL API is the documented programmatic interface.** The `files` and `downloadFiles` fields exist specifically for this use case. Since the JS CLI sidesteps these code paths entirely, regressions like this broken cache go undetected — there's no internal consumer exercising the feature.

### Do you have any interest in helping implement the feature?

Yes, but I would need guidance on the caching architecture in `datalad/files.ts`.

### Additional information / screenshots

No response.

cc @larsoner


	// Skip caching this tree if it doesn't contain S3 URLs - likely still exporting
	if (!f.directory && !f.urls[0].includes("s3.amazonaws.com")) {
	doNotCache(true)
	break
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`downloadFiles` returns only root-level files (cache never populated) #3846

Summary

Expected behavior

Actual behavior

Root cause

Impact on clients

Server-side impact

Suggestions

Why not use the git endpoint?

Do you have any interest in helping implement the feature?

Additional information / screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset	Root-level files returned	Subdirectories (not traversed)
ds000001	5 files (all root-level)	16 subdirectories
ds000117	26 files (all root-level)	19 subdirectories
ds003104	5 files (all root-level)	3 subdirectories

Dataset	Subjects	Total files	Est. directories	GraphQL queries needed	With working `downloadFiles`
ds000001	16	133	~49	~50	1
ds000117	17	1,156	~355	~355	1
ds000228	155	626	~622	~622	1
ds002345	345	4,164	~1,037	~1,037	1

downloadFiles returns only root-level files (cache never populated) #3846

Description

Summary

Expected behavior

Actual behavior

Root cause

Impact on clients

Server-side impact

Suggestions

Why not use the git endpoint?

Do you have any interest in helping implement the feature?

Additional information / screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`downloadFiles` returns only root-level files (cache never populated) #3846