This repository was archived by the owner on Nov 1, 2025. It is now read-only.
  
  
  
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Stochastic Gradient Descent (SGD) and SGD-like methods (e.g., Adam) are commonly used in PyTorch to train ML models. However, these methods rely on random data order to converge, which usually require a full data shuffle, leading to low I/O performance for disk-based storage.
We proposed a simple but novel two-level data shuffling strategy named CorgiPile (https://link.springer.com/article/10.1007/s00778-024-00845-0), which can avoid a full data shuffle while maintaining comparable convergence rate as if a full shuffle were performed. CorgiPile first samples and shuffles data at the block-level, and then shuffles data at the tuple-level within the sampled data blocks, i.e., firstly shuffling data blocks, and then merging sampled blocks in a small buffer, and finally shuffling tuples in the buffer for SGD. We have implemented CorgiPile inside PyTorch (https://github.com/DS3Lab/CorgiPile-PyTorch), and extensive experimental results show that our CorgiPile can achieve comparable convergence rate with the full-shuffle based SGD, and faster than PyTorch with full data shuffle.