🚀 Feature Request
Hey MosaicML team! Thank you so much for this awesome project! I was wondering if there are any plans to make this framework agnostic: Remove the dependency from PyTorch.
Motivation
The general idea of StreamingDataset is very useful and I believe the ML community in general will be more thrilled if we decouple this from PyTorch.
Implementation
Here are my thoughts on how we can go about this:
- The torch.utils.data.Dataset is a simple class with no dependencies with PyTorch (This is also true for the
IterableDataset) which can be very easily re-implemented here.
- However this gets a bit challenging when porting the distributed.py file. However this is where the
CuPy project comes to rescue. We can have seamless interoperability between CuPy, Jax, Tensorflow and PyTorch Tensors via the dl_pack API with no copies. And most of the functions in the distributed.py file have similar implementations in CuPy's distributed API.
- As for the
StreamingDataLoader we can have this as an optional install if installing with PyTorch backend.
- So my suggestion is if we use
CuPy instead of PyTorch we can keep this framework neutral and also have 0 copy interoperability between Jax, TF and Torch.
Additional context
If made framework agnostic:
- This can be used with
tf.data pipelines which works well with Jax and Tensorflow.
- Fits perfectly into
keras.utils.Sequence this way we can also use it with Keras-3 which is compatible with TF/Jax/PyTorch backends.
Also I will be happy to extend my support on the same if you guys think this is a potential future direction!