Releases: mosaicml/streaming
v0.13.0
What's Changed
- Fix typing by @dakinggg in #907
- Alternative authentication to Azure services by @erayinanc in #904
- Allow Spark BinaryType to map to binary-encoded MDS types like PNG, if user specifies so by @srowen in #913
- [Bug Fix] HfUploader - use the right local filename path. by @Abhinay1997 in #874
New Contributors
- @erayinanc made their first contribution in #904
- @Abhinay1997 made their first contribution in #874
Full Changelog: v0.12.0...v0.13.0
v0.12.0
What's New
1. Python 3.12 Bump (#894)
We've added support for Python 3.12 and deprecated Python 3.9 support.
What's Changed
- Bump main 0.11.0.dev0 -> 0.12.0.dev0 by @es94129 in #862
- [Fix issue 415] Fallback to In Memory Encoding for JPEG Constructed from Byte Streams by @XiaohanZhangCMU in #878
- Add Jpeg Array to MDS encoding list by @XiaohanZhangCMU in #881
- added list support for images by @ethantang-db in #882
- Bump fastapi from 0.115.6 to 0.115.12 by @dependabot in #886
- Bump pydantic from 2.10.5 to 2.10.6 by @dependabot in #866
- Update setuptools requirement from <76.0.0 to <79.0.0 by @dependabot in #891
- Bump databricks-sdk from 0.29.0 to 0.49.0 by @dependabot in #890
- Bump yamllint from 1.35.1 to 1.37.0 by @dependabot in #888
- Bump pytest from 8.3.4 to 8.3.5 by @dependabot in #887
- Update huggingface-hub requirement from <0.28,>=0.23.4 to >=0.23.4,<0.30 by @dependabot in #889
- Fix dangling file handler in Stream.get_shards() by @aadyotb in #892
- Bump Python 3.12 by @KuuCi in #894
New Contributors
Full Changelog: v0.11.0...v0.12.0
v0.11.0
🚀 Streaming v0.11.0
Streaming v0.11.0 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.11.0
What's new
1. Introducing registry for customizable components (#858)
StreamingDataset can now be used with custom Stream implementations via a registry. See the documentation page for example usage.
🐛 Bug fixes
- Fix
simulationmodule import paths (@srstevenson) - Fix
S3Downloaderserialization issues (@wouterzwerink)
What's Changed
- Bound numpy version below 2.2.0 by @snarayan21 in #849
- Fix import paths in
simulationmodule by @srstevenson in #838 - Prevent _s3_client from being serialized by @wouterzwerink in #847
- Fix a few typos by @srstevenson in #843
- Change broken user guide link to quick start by @srstevenson in #841
- Remove unused import from quick start example by @srstevenson in #842
- Change simulator UI help text to refer to directory by @srstevenson in #839
- Bump fastapi from 0.115.5 to 0.115.6 by @dependabot in #845
- Bump pydantic from 2.10.2 to 2.10.3 by @dependabot in #846
- Update mosaicml-cli requirement from <0.7,>=0.5.25 to >=0.5.25,<0.8 by @dependabot in #850
- Bump uvicorn from 0.32.1 to 0.34.0 by @dependabot in #855
- Bump pydantic from 2.10.3 to 2.10.4 by @dependabot in #856
- Update huggingface-hub requirement from <0.27,>=0.23.4 to >=0.23.4,<0.28 by @dependabot in #859
- Set
epoch_seed_changeattribute onSimulationDatasetby @srstevenson in #840 - Use registry when creating Stream in StreamingDataset by @es94129 in #858
- Bump pydantic from 2.10.4 to 2.10.5 by @dependabot in #861
New Contributors
- @srstevenson made their first contribution in #838
- @wouterzwerink made their first contribution in #847
- @es94129 made their first contribution in #858
Full Changelog: v0.10.0...v0.11.0
v0.10.0
🚀 Streaming v0.10.0
Streaming v0.10.0 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.10.0
Improvements
1. Reusable cloud download clients (#817)
- Streaming now reuses cloud download clients when downloading shard files instead of creating a new client for each download.
- This avoids run failures that sometimes occur with too many open sockets or excessive cloud authentication requests.
2: py1b shuffle algorithm deprecation (#837)
- The
py1bshuffle algorithm has now been deprecated. Please use the improvedpy1e(default) or thepy1brshuffle algorithms instead.
What's Changed
- Update FAQs to indicate wrapping not supported by @milocress in #822
- refactored the download module to have reusable clients by @ethantang-db in #817
- Update pytest-cov requirement from <6,>=4 to >=4,<7 by @dependabot in #821
- Consistent errors for unused streams in batching methods by @snarayan21 in #826
- Update setuptools requirement from <68.0.0 to <76.0.0 by @dependabot in #825
- fix f string by @XiaohanZhangCMU in #829
- Bump fastapi from 0.115.4 to 0.115.5 by @dependabot in #830
- Bump uvicorn from 0.32.0 to 0.32.1 by @dependabot in #834
- Bump pydantic from 2.9.2 to 2.10.1 by @dependabot in #833
- Bump pytest from 8.3.3 to 8.3.4 by @dependabot in #836
- Bump pydantic from 2.10.1 to 2.10.2 by @dependabot in #835
- Version bump to 0.11.0.dev0, including deprecations by @snarayan21 in #837
New Contributors
- @ethantang-db made their first contribution in #817
Full Changelog: v0.9.1...v0.10.0
v0.9.1
🚀 Streaming v0.9.1
Streaming v0.9.1 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.9.1
What's New
1. Streaming is added to Gurubase (#805)
- Streaming now has an AI assistant available to help users with their questions! Try out Streaming Guru which uses the data from this repo and data from the docs to answer questions by leveraging the LLM.
Improvements
1. Permission Issue Resolution (#813)
- Resolved read permission issues occurring when shared memory files are created in shared computing environments. We added retry conditions to allow the creation of new shared memory files upon encountering permission errors.
- Prefix Integrity for Shared Memory Files: When creating shared memory files, both LOCALS and FILELOCKS are now validated to ensure no overlap with existing files, and they are matched with consistent prefix identifiers.
- Handling Non-Normal Program Exits: Enhanced cleanup procedures to address cases where non-normal program exits left some shared memory files uncleared. All files in SHM_TO_CLEAN are now checked to prevent duplicates.
These changes improve shared memory management and reliability in shared environments.
2. Fix Shard Eviction Hanging (#795)
- Changed the search for coldest shard to avoid looping over remote shards by considering local shards only as possible candidates for eviction.
What's Changed
- Bump pydantic from 2.9.1 to 2.9.2 by @dependabot in #785
- Bump fastapi from 0.114.2 to 0.115.0 by @dependabot in #786
- Bump uvicorn from 0.30.6 to 0.31.0 by @dependabot in #793
- Fixed broken links in README.md by @LukaszSztukiewicz in #794
- Shard evict fix by @snarayan21 in #795
- Update huggingface-hub requirement from <0.25,>=0.23.4 to >=0.23.4,<0.26 by @dependabot in #787
- Fix dataset.size() typo in docs by @snarayan21 in #798
- Warning -> info about defaults from v0.7.0 by @snarayan21 in #799
- Bump uvicorn from 0.31.0 to 0.31.1 by @dependabot in #803
- Bump fastapi from 0.115.0 to 0.115.2 by @dependabot in #804
- Introducing Streaming Guru on Gurubase.io by @kursataktas in #805
- Add better error message for shared prefix by @XiaohanZhangCMU in #806
- Bump uvicorn from 0.31.1 to 0.32.0 by @dependabot in #809
- Bump pytest-split from 0.9.0 to 0.10.0 by @dependabot in #810
- Fix logo png by @XiaohanZhangCMU in #808
- Update huggingface-hub requirement from <0.26,>=0.23.4 to >=0.23.4,<0.27 by @dependabot in #814
- Bump fastapi from 0.115.2 to 0.115.4 by @dependabot in #815
- Fix shared memory permission issue in a shared pod environment by @XiaohanZhangCMU in #813
New Contributors
- @LukaszSztukiewicz made their first contribution in #794
- @kursataktas made their first contribution in #805
Full Changelog: v0.9.0...v0.9.1
v0.9.0
🚀 Streaming v0.9.0
Streaming v0.9.0 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.9.0
Whats new
1. Improved compatibility for ndarray and json types (#776, #777)
It is now possible to have columns including a map type successfully convert to JSON in an MDS file if the given type for the column is specified as 'json', and allows the JSON encoder to handle ndarray types.
What's Changed
- Bump fastapi from 0.112.1 to 0.112.2 by @dependabot in #768
- Bump ci testing by @snarayan21 in #770
- Bump jupyter from 1.0.0 to 1.1.1 by @dependabot in #772
- Bump fastapi from 0.112.2 to 0.114.0 by @dependabot in #779
- Bump pydantic from 2.8.2 to 2.9.1 by @dependabot in #778
- Allow JSON encoder to handle ndarray by @srowen in #777
- Add MapType as JSON-compatible by @srowen in #776
- Bump fastapi from 0.114.0 to 0.114.2 by @dependabot in #783
- Update datasets requirement from <3,>=2.4.0 to >=2.4.0,<4 by @dependabot in #784
- Bump pytest from 8.3.2 to 8.3.3 by @dependabot in #782
- Bump main branch to 0.10.0.dev0 by @dakinggg in #790
Full Changelog: v0.8.1...v0.9.0
v0.8.1
🚀 Streaming v0.8.1
Streaming v0.8.1 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.8.1
🔧 Improvements
Dataloader hanging between epochs has now been resolved! We've seen training time improvements of up to 40% for some many-epoch training jobs. If this was impacting your runs and has now been fixed, please let us know!
- Fix dataloader hang at the end of an epoch by @XiaohanZhangCMU in #741
- Add default compression, and warning about local paths to dataframe_to_mds by @srowen in #748
- Throw exception when event.is_set() after write()s by @srowen in #754
🐛 Bug Fixes
- Ensure deterministic sample order between epochs when
shuffle=Falseby @snarayan21 in #750
What's Changed
- Make Pytest log in color in Github Action by @eitanturok in #739
- fix azure container name and blob name in download_from_azure by @jaehwana2z in #733
- Bump uvicorn from 0.30.3 to 0.30.5 by @dependabot in #743
- Update huggingface-hub requirement from <0.24,>=0.23.4 to >=0.23.4,<0.25 by @dependabot in #729
- Bump fastapi from 0.111.1 to 0.112.0 by @dependabot in #744
- Bump ci-testing to v0.1.0 by @snarayan21 in #745
- Patching conf.py due to Sphinx deprecating config manipulation by @snarayan21 in #746
- Bump ci-testing to v0.1.2 by @snarayan21 in #747
- Type hints conformant with pep 585 by @snarayan21 in #752
- Ruff rule to remove unused imports by @snarayan21 in #756
- Fix linting for numpy 2.1.0 by @snarayan21 in #764
- Bump fastapi from 0.112.0 to 0.112.1 by @dependabot in #760
- Bump uvicorn from 0.30.5 to 0.30.6 by @dependabot in #762
- Version 0.8.1 bump! by @snarayan21 in #766
New Contributors
- @eitanturok made their first contribution in #739
- @jaehwana2z made their first contribution in #733
- @srowen made their first contribution in #748
Full Changelog: v0.8.0...v0.8.1
v0.8.0
✨ What's New ✨
1. HF File System Streaming (#711)
Streaming now supports streaming data from HF file system! This adds another popular backend as an option to host your data.
What's Changed
- Bump fastapi from 0.110.2 to 0.111.0 by @dependabot in #670
- Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes by @XiaohanZhangCMU in #668
- Ensure shards cannot be larger than 4GB by @snarayan21 in #672
- Helpful error on
py1efor improperly written datasets by @snarayan21 in #673 - Bump pytest from 8.2.0 to 8.2.1 by @dependabot in #680
- Update platform references by @aspfohl in #675
- Update CODEOWNERS by @karan6181 in #681
- Fix
batch_sizetypo forStreamobject in docs by @snarayan21 in #682 - Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in #679
- Improve local temp directory error when only
remoteis specified by @snarayan21 in #683 - Fix node calculation in
replicationforWorldobject by @snarayan21 in #685 - Warning condition changed for Sequence Parallelism by @XiaohanZhangCMU in #688
- Bump pydantic from 2.7.1 to 2.7.2 by @dependabot in #692
- Bump uvicorn from 0.29.0 to 0.30.1 by @dependabot in #691
- Make sure epoch_size is an int by @snarayan21 in #693
- Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in #687
- Bump pytest from 8.2.1 to 8.2.2 by @dependabot in #697
- fix: expand user path for Writer's output directory. by @huxuan in #694
- Bump pydantic from 2.7.2 to 2.7.3 by @dependabot in #696
- Fix edge cases with scalar or empty numpy array encoding by @snarayan21 in #702
- Raise IndexError in
Spannerobject instead ofValueErrorby @snarayan21 in #701 - Fix linting issues with numpy 2 by @snarayan21 in #705
- Bump pydantic from 2.7.3 to 2.7.4 by @dependabot in #704
- Enable correct resumption from the end of an epoch by @snarayan21 in #700
- Fix
drop_firstchecking in partitioning to account forworld_sizedivisibility by @snarayan21 in #706 - fix convert imagenet by @Hprairie in #708
- Bump pytest-split from 0.8.2 to 0.9.0 by @dependabot in #710
- Remove duplicate
dbfs:prefix from error message by @vanshcsingh in #712 - enable adaptive retry for s3 download by @bigning in #713
- Upgrade ci_testing, remove codeql by @snarayan21 in #714
- Fix Linting from Pillow version update by @XiaohanZhangCMU in #719
- Bump pydantic from 2.7.4 to 2.8.2 by @dependabot in #718
- Bump databricks-sdk from 0.28.0 to 0.29.0 by @dependabot in #715
- Add HF File System Support to Streaming by @orionw in #711
- Improve error message on non-0 rank when index file download failed by @bigning in #723
- Bump pytest from 8.2.2 to 8.3.2 by @dependabot in #735
- Bump uvicorn from 0.30.1 to 0.30.3 by @dependabot in #730
- Bump fastapi from 0.111.0 to 0.111.1 by @dependabot in #724
- Bump Streaming Version to 0.8.0 by @mvpatel2000 in #738
New Contributors
- @aspfohl made their first contribution in #675
- @huxuan made their first contribution in #694
- @Hprairie made their first contribution in #708
- @vanshcsingh made their first contribution in #712
- @orionw made their first contribution in #711
Full Changelog: v0.7.6...v0.8.0
v0.7.6
🚀 Streaming v0.7.6
Streaming v0.7.6 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.7.6
💎 New Features
1. device_per_stream batching method
Users can now construct batches such that each device sees only samples from a single stream. This is very useful in cases where different data sources have samples/tensors of different sizes, but the model should still see samples from these different data sources at each optimizer step.
- Adding
device_per_streambatching by @snarayan21 in #661
2. Add ndarray type for Spark dataframes.
Enable parsing Spark's ArrayType (of ShortType, LongType, IntegerType, FloatType, DoubleType) when converting a Spark dataframe to MDS.
- Add ndarray type by @XiaohanZhangCMU in #623
3. Support for Alipan storage
Adds support for Alipan, Alibaba's cloud storage service.
- Add support for Alipan Storage backend by @PeterDing in #651
What's Changed
- Bump fastapi from 0.110.0 to 0.110.2 by @dependabot in #660
- Bump pydantic from 2.6.4 to 2.7.0 by @dependabot in #653
- Bump pydantic from 2.7.0 to 2.7.1 by @dependabot in #666
- Bump pytest from 8.1.1 to 8.2.0 by @dependabot in #664
- Bump databricks-sdk from 0.23.0 to 0.27.0 by @dependabot in #667
- Version bump to v0.7.6 by @snarayan21 in #669
New Contributors
- @PeterDing made their first contribution in #651
Full Changelog: v0.7.5...v0.7.6
v0.7.5
🚀 Streaming v0.7.5
Streaming v0.7.5 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.7.5
💎 New Features
1. Tensor/Sequence Parallelism Support
Using the replication argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.
- Replicating samples across devices (SP / TP enablement) by @knighton in #597
- Expanded replication testing + documentation by @snarayan21 in #607
- Make streaming use the correct number of unique samples with SP/TP by @snarayan21 in #619
2. Overhauled Streaming Documentation
New and improved streaming documentation can be found here -- please submit issues with any feedback.
- Major overhaul of Streaming documentation by @snarayan21 in #636
3. batch_size is now required for StreamingDataset
As we have seen multiple errors and performance degradations from users not setting the batch_size argument to StreamingDataset, we are making it a requirement to iterate over the dataset.
- You must set batch size. There is no other way. by @snarayan21 in #624
3. Support for Python 3.11, deprecate Python 3.8
- Add support for Python 3.11 and deprecate Python 3.8 by @karan6181 in #586
🐛 Bug Fixes
- [easy typo fix] fix f-string by @bigning in #596
- Change comparison in partitions to include equals by @JAEarly in #587
- Use type int when initializing SharedMemory size by @bchiang2 in #604
- COCO Dataset fix -- avoids
allow_unsafe_types=Trueby @snarayan21 in #647
🔧 Improvements
- Allow writers to overwrite existing data by @JAEarly in #594
- Update careers link by @milocress in #611
- Update license by @b-chu in #568
- Updated documentation for S3-compatible object stores by @AIproj in #592
- Make yamllint consistent with Composer by @b-chu in #583
- Switch linting workflows to ci-testing repo by @b-chu in #616
What's Changed
- Bump uvicorn from 0.26.0 to 0.27.1 by @dependabot in #599
- Bump pytest-split from 0.8.1 to 0.8.2 by @dependabot in #581
- Update ruff to 0.2.2 by @Skylion007 in #608
- Bump fastapi from 0.109.0 to 0.110.0 by @dependabot in #610
- Bump yamllint from 1.33.0 to 1.35.1 by @dependabot in #601
- Bump uvicorn from 0.27.1 to 0.28.0 by @dependabot in #626
- Update moto requirement from <5,>=4.0 to >=4.0,<6 by @dependabot in #580
- Bump furo from 2023.7.26 to 2024.1.29 by @dependabot in #631
- Bump pypandoc from 1.12 to 1.13 by @dependabot in #630
- Bump databricks-sdk from 0.14.0 to 0.22.0 by @dependabot in #629
- Add batch_size to 1 if not provided for regression testing by @karan6181 in #635
- Fixed docstring note for getting sequential sample ordering by @snarayan21 in #632
- Bump pytest and fix failing test by @snarayan21 in #642
- Update pytest-cov requirement from <5,>=4 to >=4,<6 by @dependabot in #638
- Bump pydantic from 2.5.3 to 2.6.4 by @dependabot in #639
- Bump uvicorn from 0.28.0 to 0.29.0 by @dependabot in #640
- Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in #644
- Version bump to 0.7.5 by @snarayan21 in #650
New Contributors
- @bigning made their first contribution in #596
- @JAEarly made their first contribution in #587
- @AIproj made their first contribution in #592
- @milocress made their first contribution in #611
- @bchiang2 made their first contribution in #604
Full Changelog: v0.7.4...v0.7.5