From ae964ee177d8d953689dc2251927e42e1b567171 Mon Sep 17 00:00:00 2001 From: iPLAY888 <133153661+letmehateu@users.noreply.github.com> Date: Wed, 15 Oct 2025 13:34:44 +0300 Subject: [PATCH 1/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a09f8ccb1b..abd1a7741c 100644 --- a/README.md +++ b/README.md @@ -55,7 +55,7 @@ To accelerate contributions to and innovations around torchtitan, we are hosting 1. Multi-dimensional composable parallelisms - [FSDP2](docs/fsdp.md) with per-parameter sharding - - [Tensor Parallel](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) (including [async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)) + - [Tensor Parallel](https://docs.pytorch.org/docs/stable/distributed.tensor.parallel.html) (including [async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)) - [Pipeline Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-training-with-zero-bubble-pipeline-parallelism/214420) - [Context Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082) 2. [Meta device](https://pytorch.org/docs/stable/meta.html) initialization From 7e0a08caef3da9b2fa4d10b23f7e73c9f5ba3164 Mon Sep 17 00:00:00 2001 From: iPLAY888 <133153661+letmehateu@users.noreply.github.com> Date: Wed, 15 Oct 2025 13:35:43 +0300 Subject: [PATCH 2/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index abd1a7741c..1cceac06ed 100644 --- a/README.md +++ b/README.md @@ -58,7 +58,7 @@ To accelerate contributions to and innovations around torchtitan, we are hosting - [Tensor Parallel](https://docs.pytorch.org/docs/stable/distributed.tensor.parallel.html) (including [async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)) - [Pipeline Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-training-with-zero-bubble-pipeline-parallelism/214420) - [Context Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082) -2. [Meta device](https://pytorch.org/docs/stable/meta.html) initialization +2. [Meta device](https://docs.pytorch.org/docs/stable/meta.html) initialization 3. Selective (layer or operator) and full activation checkpointing 4. [Distributed checkpointing](https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250) (including async checkpointing) - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning From 1b94c110566ac6629469413ddae5591a8ac8685c Mon Sep 17 00:00:00 2001 From: iPLAY888 <133153661+letmehateu@users.noreply.github.com> Date: Wed, 15 Oct 2025 13:36:06 +0300 Subject: [PATCH 3/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1cceac06ed..3b92e77482 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ To accelerate contributions to and innovations around torchtitan, we are hosting 2. [Meta device](https://docs.pytorch.org/docs/stable/meta.html) initialization 3. Selective (layer or operator) and full activation checkpointing 4. [Distributed checkpointing](https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250) (including async checkpointing) - - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning + - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/meta-pytorch/torchtune) for fine-tuning 5. `torch.compile` support 6. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md)) 7. DDP and HSDP From 75b770be035ef7305fa15b5ab175548be77d3a02 Mon Sep 17 00:00:00 2001 From: iPLAY888 <133153661+letmehateu@users.noreply.github.com> Date: Wed, 15 Oct 2025 13:36:53 +0300 Subject: [PATCH 4/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3b92e77482..3e5e8b9e6a 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ To accelerate contributions to and innovations around torchtitan, we are hosting 5. `torch.compile` support 6. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md)) 7. DDP and HSDP -8. [TorchFT](https://github.com/pytorch/torchft) integration +8. [TorchFT](https://github.com/meta-pytorch/torchft) integration 9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs/datasets.md) 10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument in configuration 11. Flexible learning rate scheduler (warmup-stable-decay) From f6ee6590a17b81534d69a952fce2dd533f985e97 Mon Sep 17 00:00:00 2001 From: iPLAY888 <133153661+letmehateu@users.noreply.github.com> Date: Wed, 15 Oct 2025 13:39:33 +0300 Subject: [PATCH 5/5] Update fsdp.md --- docs/fsdp.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/fsdp.md b/docs/fsdp.md index 3f2c7f5e6e..d1f0348072 100644 --- a/docs/fsdp.md +++ b/docs/fsdp.md @@ -1,7 +1,7 @@ # FSDP1 -> FSDP2 ## Why FSDP2? -PyTorch's fully sharded data parallelism (FSDP) API, [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html), looks to offer a performant eager-mode implementation, including communication bucketing and communication/computation overlap. It defines a `FlatParameter` by flattening and concatenating a group of parameters to represent a communication bucket. However, this `FlatParameter` complicates applying different behaviors to individual parameters within the `FlatParameter`, e.g. parameter freezing, parameter casting, etc., hurting composability, and it complicates the internal implementation, e.g. making state dict logic thousands of lines and requiring additional communications. +PyTorch's fully sharded data parallelism (FSDP) API, [`FullyShardedDataParallel`](https://docs.pytorch.org/docs/stable/fsdp.html), looks to offer a performant eager-mode implementation, including communication bucketing and communication/computation overlap. It defines a `FlatParameter` by flattening and concatenating a group of parameters to represent a communication bucket. However, this `FlatParameter` complicates applying different behaviors to individual parameters within the `FlatParameter`, e.g. parameter freezing, parameter casting, etc., hurting composability, and it complicates the internal implementation, e.g. making state dict logic thousands of lines and requiring additional communications. With these limitations in mind, we designed and implemented an FSDP rewrite removing the `FlatParameter`. We refer to this rewrite as FSDP2 and the original as FSDP1. FSDP2 targets the same use cases as FSDP1 plus more, and FSDP2 still strives for good performance in eager mode, using several of the same techniques.