-
Notifications
You must be signed in to change notification settings - Fork 188
Open
Labels
enhancementNew feature or requestNew feature or request
Description
🚀 Feature Request
I would like the ability to only load certain key/value pairs of the dict of a StreamingDataset. These would be specified on initialization of the object.
Motivation
I have a dataset with many fields which supports many of our users' different training and testing needs. However, each user typically only uses a smallish subset of the fields in the dataset, the remainder are loaded and not used. To streamline data load it seems I would have to subselect data and save to separate datasets appropriate for each use, which is difficult to manage, prone to errors, and likely to have some storage redundancy. Better would be to only load what is needed from each shard.
[Optional] Implementation
I can think of three ways to do this:
- (this does not save any time, and is ridiculous, but provides the requested functionality): load the datapoint as is done currently and discard fields not requested.
- Save each field to a separate dataset, then perform a bunch of coordinated reads and stitch them together. I suppose I could do this myself if it were easy to coordinate the random reads of each sub object.
- Ask here for this functionality to be added to the package. If the design of shard files does not easily support this modification, this is a dead end.
Additional context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request