diff --git a/README.md b/README.md index a09f8ccb1b..3e5e8b9e6a 100644 --- a/README.md +++ b/README.md @@ -55,17 +55,17 @@ To accelerate contributions to and innovations around torchtitan, we are hosting 1. Multi-dimensional composable parallelisms - [FSDP2](docs/fsdp.md) with per-parameter sharding - - [Tensor Parallel](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) (including [async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)) + - [Tensor Parallel](https://docs.pytorch.org/docs/stable/distributed.tensor.parallel.html) (including [async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)) - [Pipeline Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-training-with-zero-bubble-pipeline-parallelism/214420) - [Context Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082) -2. [Meta device](https://pytorch.org/docs/stable/meta.html) initialization +2. [Meta device](https://docs.pytorch.org/docs/stable/meta.html) initialization 3. Selective (layer or operator) and full activation checkpointing 4. [Distributed checkpointing](https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250) (including async checkpointing) - - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning + - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/meta-pytorch/torchtune) for fine-tuning 5. `torch.compile` support 6. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md)) 7. DDP and HSDP -8. [TorchFT](https://github.com/pytorch/torchft) integration +8. [TorchFT](https://github.com/meta-pytorch/torchft) integration 9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs/datasets.md) 10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument in configuration 11. Flexible learning rate scheduler (warmup-stable-decay) diff --git a/docs/fsdp.md b/docs/fsdp.md index 3f2c7f5e6e..d1f0348072 100644 --- a/docs/fsdp.md +++ b/docs/fsdp.md @@ -1,7 +1,7 @@ # FSDP1 -> FSDP2 ## Why FSDP2? -PyTorch's fully sharded data parallelism (FSDP) API, [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html), looks to offer a performant eager-mode implementation, including communication bucketing and communication/computation overlap. It defines a `FlatParameter` by flattening and concatenating a group of parameters to represent a communication bucket. However, this `FlatParameter` complicates applying different behaviors to individual parameters within the `FlatParameter`, e.g. parameter freezing, parameter casting, etc., hurting composability, and it complicates the internal implementation, e.g. making state dict logic thousands of lines and requiring additional communications. +PyTorch's fully sharded data parallelism (FSDP) API, [`FullyShardedDataParallel`](https://docs.pytorch.org/docs/stable/fsdp.html), looks to offer a performant eager-mode implementation, including communication bucketing and communication/computation overlap. It defines a `FlatParameter` by flattening and concatenating a group of parameters to represent a communication bucket. However, this `FlatParameter` complicates applying different behaviors to individual parameters within the `FlatParameter`, e.g. parameter freezing, parameter casting, etc., hurting composability, and it complicates the internal implementation, e.g. making state dict logic thousands of lines and requiring additional communications. With these limitations in mind, we designed and implemented an FSDP rewrite removing the `FlatParameter`. We refer to this rewrite as FSDP2 and the original as FSDP1. FSDP2 targets the same use cases as FSDP1 plus more, and FSDP2 still strives for good performance in eager mode, using several of the same techniques.