Distributed Training Lab

╔══════════════════════════════════════════════════════════════╗
║                                                              ║
║                    ██╗  ██╗██╗                               ║
║                    ██║  ██║██║                               ║
║                    ███████║██║                               ║
║                    ██╔══██║██║                               ║
║                    ██║  ██║██║                               ║
║                    ╚═╝  ╚═╝╚═╝                               ║
║                                                              ║
║         DDP • FSDP • Benchmarking • Single-Node              ║
╚══════════════════════════════════════════════════════════════╝

Minimal PyTorch DDP/FSDP lab for benchmarking distributed training on a single node.

Motivation

This repo exists because I care deeply about PyTorch internals and performance. It's a clean reference implementation for both interview discussions and real production workloads. Everything is designed to run on a single machine with 1-4 GPUs, making it accessible while demonstrating production-quality distributed training patterns.

Features

Single-node multi-GPU DDP example - Full model replication with gradient synchronization
Single-node FSDP example - Parameter sharding with configurable strategies
Simple CIFAR-10 dataset - Fast iteration, no huge downloads
Benchmark script - Compares DDP vs FSDP on:
- Samples/sec throughput
- Average step time
- Peak GPU memory usage

Quickstart

git clone https://github.com/yourusername/distributed-training-lab.git
cd distributed-training-lab
pip install -r requirements.txt

DDP (single node, 4 GPUs)

torchrun --nproc_per_node=4 src/ddp_train.py --config configs/ddp_resnet_cifar10.yaml

FSDP

torchrun --nproc_per_node=4 src/fsdp_train.py --config configs/fsdp_resnet_cifar10.yaml

Benchmark both

python src/benchmark_runner.py

Or use the convenience scripts:

bash scripts/run_ddp_single_node.sh
bash scripts/run_fsdp_single_node.sh

Configuration

YAML configs control model architecture, batch size, learning rate, optimizer, and FSDP sharding strategy. Key settings:

Model: Small ResNet (configurable depth/width)
Batch size: Per-GPU batch size (effective batch = batch_size × num_gpus)
Precision: FP32 by default, FP16 available for FSDP
Sharding: FULL_SHARD (default) or SHARD_GRAD_OP for FSDP

Example config structure:

model:
  name: "small_resnet"
  num_layers: 18
  
training:
  batch_size: 128
  num_epochs: 10
  learning_rate: 0.1
  optimizer: "sgd"

What This Demonstrates

DDP replication vs FSDP sharding: DDP keeps full model on each GPU; FSDP splits parameters across GPUs, trading memory for communication overhead
Batch size scaling: Effective batch size = per_gpu_batch × num_gpus; larger batches improve GPU utilization but require more memory
Memory vs communication tradeoff: DDP uses more memory but less communication; FSDP saves memory but adds all-gather overhead
DistributedSampler usage: Ensures each GPU sees different data partitions, critical for correct distributed training
Gradient synchronization: DDP uses allreduce (one op per step); FSDP uses allgather + allreduce (more ops, but sharded)
Single-node patterns: How to structure code that works on 1 GPU (testing) and scales to 4 GPUs (production)

Limitations / Future Work

No multi-node launcher yet (single-node focus keeps it simple)
Could add mixed precision variants, gradient checkpointing, larger models
Pipeline/tensor parallelism not included (keeps scope focused)

Repository Structure

distributed-training-lab/
├── README.md
├── requirements.txt
├── configs/
│   ├── ddp_resnet_cifar10.yaml
│   └── fsdp_resnet_cifar10.yaml
├── src/
│   ├── data.py              # CIFAR-10 dataset + dataloaders
│   ├── models.py            # Small ResNet implementation
│   ├── utils.py             # Logging, timing, seeding
│   ├── ddp_train.py         # DDP training entrypoint
│   ├── fsdp_train.py        # FSDP training entrypoint
│   └── benchmark_runner.py  # Runs both, prints comparison table
├── scripts/
│   ├── run_ddp_single_node.sh
│   └── run_fsdp_single_node.sh
└── examples/
    └── notes_ddp_vs_fsdp.md  # Conceptual notes

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Distributed Training Lab

Motivation

Features

Quickstart

DDP (single node, 4 GPUs)

FSDP

Benchmark both

Configuration

What This Demonstrates

Limitations / Future Work

Repository Structure

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

iotaaxel/distributed-training-lab

Folders and files

Latest commit

History

Repository files navigation

Distributed Training Lab

Motivation

Features

Quickstart

DDP (single node, 4 GPUs)

FSDP

Benchmark both

Configuration

What This Demonstrates

Limitations / Future Work

Repository Structure

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages