Feature request: support for split/Subset-like behavior and shared local cache in StreamingDataset

<html>
<body>
<html><head></head><body>
Hi, and thank you for the excellent work on this package!
I've been using <code inline="">StreamingDataset</code> for large-scale training with a dataset stored on S3, and I've encountered two major limitations that I couldn't find addressed in the documentation. I’d appreciate clarification or guidance on best practices, and I'd like to propose feature support if these aren't currently handled.
<hr>
<h3>📦 Use Case</h3>
We’re working with a very large dataset (~100 TB) stored in an S3-compatible object storage system. The dataset is preprocessed into MDS format and hosted in a shared S3 bucket. Each sample includes metadata (e.g., <code inline="">age</code>, <code inline="">sex</code>, etc.), and users often want to:
<ol>
<li>
Filter the dataset based on metadata (e.g., only use female patients over age 60)
</li>
<li>
Split the dataset into train/val/test subsets using arbitrary logic
</li>
</ol>
This is a common scenario in medical and scientific domains where researchers need flexible subsets from a shared, centralized dataset.
<hr>
<h3>❗ Current Limitation 1: Subset behavior unclear and incompatible with DDP</h3>
For single-GPU training, we’ve been able to use <code inline="">torch.utils.data.Subset(streaming_dataset, indices)</code> to split or filter the dataset, and it seems to work as expected.
However, this usage is not documented anywhere, and I want to confirm:
<ul>
<li>
Is using <code inline="">Subset</code> on a <code inline="">StreamingDataset</code> officially supported?
</li>
<li>
Are there caveats around <code inline="">__len__</code>, shuffling, or streaming behavior when using <code inline="">Subset</code> this way?
</li>
<li>
Will it break prefetching or caching mechanisms internally?
</li>
</ul>
For multi-GPU training (DDP), the problem becomes more serious: 
As soon as the <code inline="">StreamingDataset</code> is initialized, it automatically shards the dataset across processes based on <code inline="">RANK</code>, <code inline="">WORLD_SIZE</code>, etc. This means:
<ul>
<li>
Any filtering or splitting logic (e.g., <code inline="">Subset</code>) only applies to that local shard
</li>
<li>
The user has no way to enforce a consistent global split across ranks
</li>
<li>
It becomes impossible to implement deterministic train/val/test splits or metadata-based filtering unless the dataset is physically reprocessed into new shard directories (which is infeasible at 100 TB scale)
</li>
</ul>
We would benefit greatly from:
<ul>
<li>
A mechanism to restrict the global sample list before DDP sharding occurs, such as a <code inline="">split()</code> method or <code inline="">partition_index</code> constructor argument that defines a subset of the dataset globally
</li>
<li>
Or a way to defer sharding so that the user can apply their own filtering or sampling logic prior to partitioning for distributed training
</li>
</ul>
<hr>
<h3>❗ Current Limitation 2: Shared local cache not supported</h3>
In our cluster setup, multiple users and jobs need to train on the same dataset. To save bandwidth and storage, we want to share the same local cache directory (e.g., <code inline="">/scratch/mds_cache</code>) between users or jobs.
Currently:
<ul>
<li>
Each process or user creates its own local cache path, leading to duplicated downloads
</li>
<li>
Reusing the same path leads to potential race conditions, errors, or inconsistent states
</li>
</ul>
We would love to see support for:
<ul>
<li>
A safe shared cache mode — possibly with options for:
<ul>
<li>
Read-only access
</li>
<li>
Locking mechanisms
</li>
<li>
Checksum validation for partially downloaded files
</li>
</ul>
</li>
</ul>
This would make streaming much more scalable for large shared environments.
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support for split/Subset-like behavior and shared local cache in StreamingDataset #916

📦 Use Case

❗ Current Limitation 1: Subset behavior unclear and incompatible with DDP

❗ Current Limitation 2: Shared local cache not supported

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: support for split/Subset-like behavior and shared local cache in StreamingDataset #916

Description

📦 Use Case

❗ Current Limitation 1: Subset behavior unclear and incompatible with DDP

❗ Current Limitation 2: Shared local cache not supported

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions