|
1 | | -PyTorch Distributed Overview |
| 1 | +PyTorch ๋ถ์ฐ ๊ฐ์ |
2 | 2 | ============================ |
3 | | -**Author**: `Will Constable <https://github.com/wconstab/>`_, `Wei Feng <https://github.com/weifengpy>`_ |
4 | | - |
| 3 | +**์ ์**: `Will Constable <https://github.com/wconstab/>`_, `Wei Feng <https://github.com/weifengpy>`_ |
| 4 | +**๋ฒ์ญ**: `๊ฐ์งํ <https://github.com/KJH622>`_ |
5 | 5 | .. note:: |
6 | | - |edit| View and edit this tutorial in `github <https://github.com/pytorchkorea/tutorials-kr/blob/main/beginner_source/dist_overview.rst>`__. |
7 | | - |
8 | | -This is the overview page for the ``torch.distributed`` package. The goal of |
9 | | -this page is to categorize documents into different topics and briefly |
10 | | -describe each of them. If this is your first time building distributed training |
11 | | -applications using PyTorch, it is recommended to use this document to navigate |
12 | | -to the technology that can best serve your use case. |
| 6 | + |edit| ์ด ํํ ๋ฆฌ์ผ์ ์ฌ๊ธฐ์ ๋ณด๊ณ ํธ์งํ์ธ์ `github <https://github.com/pytorchkorea/tutorials-kr/blob/main/beginner_source/dist_overview.rst>`__. |
13 | 7 |
|
| 8 | +์ด ๋ฌธ์๋ ``torch.distributed`` ํจํค์ง์ ๊ฐ์ ํ์ด์ง์
๋๋ค. |
| 9 | +์ด ํ์ด์ง์ ๋ชฉํ๋ ๋ฌธ์๋ฅผ ์ฃผ์ ๋ณ๋ก ๋ถ๋ฅํ๊ณ |
| 10 | +๊ฐ ์ฃผ์ ๋ฅผ ๊ฐ๋ตํ ์ค๋ช
ํ๋ ๊ฒ์
๋๋ค. PyTorch๋ก ๋ถ์ฐ ํ์ต ์ ํ๋ฆฌ์ผ์ด์
์ ์ฒ์ ๊ตฌ์ถํ๋ค๋ฉด, |
| 11 | +์ด ๋ฌธ์๋ฅผ ์ฐธ๊ณ ํ์ฌ ์ฌ๋ฌ๋ถ์ ์ฌ์ฉ ์ฌ๋ก์ ๊ฐ์ฅ ์ ํฉํ ๊ธฐ์ ์ ์ฐพ์๋ณด๋ ๊ฒ์ ๊ถ์ฅํฉ๋๋ค. |
14 | 12 |
|
15 | | -Introduction |
| 13 | +์๋ก |
16 | 14 | ------------ |
17 | 15 |
|
18 | | -The PyTorch Distributed library includes a collective of parallelism modules, |
19 | | -a communications layer, and infrastructure for launching and |
20 | | -debugging large training jobs. |
| 16 | +ํ์ดํ ์น ๋ถ์ฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ ์ฌ๋ฌ ๋ณ๋ ฌํ ๋ชจ๋, ํต์ ๊ณ์ธต, ๊ทธ๋ฆฌ๊ณ ๋๊ท๋ชจ ํ์ต ์์
์ ์คํ ๋ฐ ๋๋ฒ๊น
์ ์ํ ์ธํ๋ผ๋ก ๊ตฌ์ฑ๋ฉ๋๋ค. |
21 | 17 |
|
22 | 18 |
|
23 | | -Parallelism APIs |
| 19 | +๋ณ๋ ฌ ์ฒ๋ฆฌ API |
24 | 20 | **************** |
25 | 21 |
|
26 | | -These Parallelism Modules offer high-level functionality and compose with existing models: |
| 22 | +์ด๋ฌํ ๋ณ๋ ฌํ ๋ชจ๋์ ๊ณ ์์ค ๊ธฐ๋ฅ์ ์ ๊ณตํ๋ฉฐ ๊ธฐ์กด ๋ชจ๋ธ๊ณผ ์กฐํฉํ์ฌ ์ฌ์ฉํ ์ ์์ต๋๋ค. |
27 | 23 |
|
28 | | -- `Distributed Data-Parallel (DDP) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ |
29 | | -- `Fully Sharded Data-Parallel Training (FSDP2) <https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`__ |
30 | | -- `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ |
31 | | -- `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ |
| 24 | +- `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ (DDP, Distributed Data-Parallel) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__ |
| 25 | +- `์์ ์ค๋ฉ ๋ฐ์ดํฐ ๋ณ๋ ฌ ํ์ต (FSDP2, Fully Sharded Data-Parallel Training) <https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`__ |
| 26 | +- `ํ
์ ๋ณ๋ ฌ ์ฒ๋ฆฌ (TP, Tensor Parallel) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ |
| 27 | +- `ํ์ดํ๋ผ์ธ ๋ณ๋ ฌ ์ฒ๋ฆฌ (PP, Pipeline Parallel) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ |
32 | 28 |
|
33 | | -Sharding primitives |
| 29 | +์ค๋ฉ ๊ธฐ๋ณธ ์์(Sharding primitives) |
34 | 30 | ******************* |
35 | 31 |
|
36 | | -``DTensor`` and ``DeviceMesh`` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups. |
| 32 | +``DTensor`` ์ ``DeviceMesh`` ๋ N์ฐจ์ ํ๋ก์ธ์ค ๊ทธ๋ฃน์์ ํ
์๋ฅผ ์ค๋ฉํ๊ฑฐ๋ ๋ณต์ ํ๋ ๋ฐฉ์์ผ๋ก ๋ณ๋ ฌํ๋ฅผ ๊ตฌ์ฑํ ๋ ์ฌ์ฉํ๋ ๊ธฐ๋ณธ ๊ตฌ์ฑ์์์
๋๋ค. |
37 | 33 |
|
38 | | -- `DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md>`__ represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations. |
39 | | -- `DeviceMesh <https://pytorch.org/docs/stable/distributed.html#devicemesh>`__ abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ``ProcessGroup`` instances for collective communications in multi-dimensional parallelisms. Try out our `Device Mesh Recipe <https://tutorials.pytorch.kr/recipes/distributed_device_mesh.html>`__ to learn more. |
| 34 | +- `DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/README.md>`__ ๋ ์ค๋ฉ๋๊ฑฐ๋/๋ณต์ ๋ ํ
์๋ฅผ ๋ํ๋ด๋ฉฐ, ์ฐ์ฐ์ ์๊ตฌ์ ๋ฐ๋ผ ํ
์๋ฅผ ์ฌ์ค๋ฉํ๊ธฐ ์ํ ํต์ ์ ์๋์ผ๋ก ์ํํฉ๋๋ค. |
| 35 | +- `DeviceMesh <https://pytorch.org/docs/stable/distributed.html#devicemesh>`__ ๋ ๊ฐ์๊ธฐ ๋๋ฐ์ด์ค์ ์ปค๋ฎค๋์ผ์ดํฐ(communicator)๋ฅผ ๋ค์ฐจ์ ๋ฐฐ์ด๋ก ์ถ์ํํ๋ฉฐ, ๋ค์ฐจ์ ๋ณ๋ ฌ์ฑ์์ ์งํฉ(collective) ํต์ ์ ์ํํ๊ธฐ ์ํ ํ์ ``ProcessGroup`` ์ธ์คํด์ค๋ค์ ๊ด๋ฆฌํฉ๋๋ค. ๋ ์์๋ณด๋ ค๋ฉด `Device Mesh ๋ ์ํผ <https://tutorials.pytorch.kr/recipes/distributed_device_mesh.html>`__ ๋ฅผ ์ง์ ๋ฐ๋ผ ํด๋ณด์ธ์. |
40 | 36 |
|
41 | | -Communications APIs |
| 37 | +ํต์ API |
42 | 38 | ******************* |
43 | 39 |
|
44 | | -The `PyTorch distributed communication layer (C10D) <https://pytorch.org/docs/stable/distributed.html>`__ offers both collective communication APIs (e.g., `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__ |
45 | | - and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__) |
46 | | - and P2P communication APIs (e.g., |
47 | | - `send <https://pytorch.org/docs/stable/distributed.html#torch.distributed.send>`__ |
48 | | - and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__), |
49 | | - which are used under the hood in all of the parallelism implementations. |
50 | | - `Writing Distributed Applications with PyTorch <../intermediate/dist_tuto.html>`__ |
51 | | - shows examples of using c10d communication APIs. |
| 40 | +`PyTorch ๋ถ์ฐ ํต์ ๊ณ์ธต (C10D) <https://pytorch.org/docs/stable/distributed.html>`__ ์ ์งํฉ ํต์ API (์: `all_reduce(์ ์ฒด ์ถ์) <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__ |
| 41 | + , `all_gather(์ ์ฒด ์์ง) <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__) |
| 42 | + ์ P2P ํต์ API (์: `send(๋๊ธฐ ์ ์ก) <https://pytorch.org/docs/stable/distributed.html#torch.distributed.send>`__ |
| 43 | + , `isend(๋น๋๊ธฐ ์ ์ก) <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__)๋ฅผ ๋ชจ๋ ์ ๊ณตํ๋ฉฐ, |
| 44 | + ์ด๋ฌํ API๋ ๋ชจ๋ ๋ณ๋ ฌํ ๊ตฌํ์์ ๋ด๋ถ์ ์ผ๋ก ์ฌ์ฉ๋ฉ๋๋ค. |
| 45 | + `PyTorch๋ก ๋ถ์ฐ ์ ํ๋ฆฌ์ผ์ด์
์์ฑํ๊ธฐ <../intermediate/dist_tuto.html>`__ ๋ C10D ํต์ API ์ฌ์ฉ ์์ ๋ฅผ ๋ณด์ฌ ์ค๋๋ค. |
52 | 46 |
|
53 | | -Launcher |
| 47 | +์คํ๊ธฐ(Launcher) |
54 | 48 | ******** |
55 | 49 |
|
56 | | -`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs. |
| 50 | +`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ ์ ๋๋ฆฌ ์ฐ์ด๋ ์คํ๊ธฐ ์คํฌ๋ฆฝํธ๋ก, ๋ถ์ฐ PyTorch ํ๋ก๊ทธ๋จ์ ์คํํ๊ธฐ ์ํด ๋ก์ปฌ ๋ฐ ์๊ฒฉ ๋จธ์ ์์ ํ๋ก์ธ์ค๋ฅผ ์์ฑํฉ๋๋ค. |
57 | 51 |
|
58 | 52 |
|
59 | | -Applying Parallelism To Scale Your Model |
| 53 | +๋ชจ๋ธ ํ์ฅ์ ์ํ ๋ณ๋ ฌํ ์ ์ฉ |
60 | 54 | ---------------------------------------- |
61 | 55 |
|
62 | | -Data Parallelism is a widely adopted single-program multiple-data training paradigm |
63 | | -where the model is replicated on every process, every model replica computes local gradients for |
64 | | -a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step. |
| 56 | +๋ฐ์ดํฐ ๋ณ๋ ฌํ(Data Parallelism)๋ ๋๋ฆฌ ์ฑํ๋ SPMD(single-program multiple-data) ํ์ต ํจ๋ฌ๋ค์์ผ๋ก, |
| 57 | +๋ชจ๋ธ์ด ๋ชจ๋ ํ๋ก์ธ์ค์ ๋ณต์ ๋๊ณ ๊ฐ ๋ชจ๋ธ์ ๋ณต์ ๋ณธ์ด ์๋ก ๋ค๋ฅธ ์
๋ ฅ ๋ฐ์ดํฐ ์ํ ์งํฉ์ ๋ํด ๋ก์ปฌ ๋ณํ๋๋ฅผ ๊ณ์ฐํฉ๋๋ค. |
| 58 | +๊ทธ๋ฐ ๋ค์ ๊ฐ ์ตํฐ๋ง์ด์ ์คํ
์ ์ ๋ฐ์ดํฐ-๋ณ๋ ฌ ํต์ ๊ทธ๋ฃน ๋ด์์ ๋ณํ๋๋ฅผ ํ๊ท ํํฉ๋๋ค. |
65 | 59 |
|
66 | | -Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn't fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques. |
| 60 | +๋ชจ๋ธ ๋ณ๋ ฌํ(Model Parallelism) ๊ธฐ๋ฒ(๋๋ ์ค๋ฉ๋ ๋ฐ์ดํฐ ๋ณ๋ ฌํ)์ ๋ชจ๋ธ์ด GPU ๋ฉ๋ชจ๋ฆฌ์ ๋ค์ด๊ฐ์ง ์์ ๋ ํ์ํ๋ฉฐ, ์๋ก ๊ฒฐํฉํด ๋ค์ฐจ์(N-D) ๋ณ๋ ฌํ ๊ธฐ๋ฒ์ ๊ตฌ์ฑํ ์ ์์ต๋๋ค. |
67 | 61 |
|
68 | | -When deciding what parallelism techniques to choose for your model, use these common guidelines: |
| 62 | +๋ชจ๋ธ์ ์ ์ฉํ ๋ณ๋ ฌํ ๊ธฐ๋ฒ์ ๊ฒฐ์ ํ ๋๋ ๋ค์์ ์ผ๋ฐ์ ์ธ ์ง์นจ์ ์ฐธ๊ณ ํ์ธ์. |
69 | 63 |
|
70 | | -#. Use `DistributedDataParallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`__, |
71 | | - if your model fits in a single GPU but you want to easily scale up training using multiple GPUs. |
| 64 | +#. ๋ชจ๋ธ์ด ๋จ์ผ GPU๋ฅผ ํ์ฌํ ์ ์์ง๋ง, ์ฌ๋ฌ GPU๋ก ์ฝ๊ฒ ํ์ต์ ํ์ฅํ๊ณ ์ถ๋ค๋ฉด |
| 65 | + `DistributedDataParallel (DDP, ๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌํ) <https://pytorch.org/docs/stable/notes/ddp.html>`__ ๋ฅผ ์ฌ์ฉํ์ธ์. |
72 | 66 |
|
73 | | - * Use `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__, to launch multiple pytorch processes if you are using more than one node. |
| 67 | + * ์ฌ๋ฌ ๋
ธ๋๋ฅผ ์ฌ์ฉํ๋ ๊ฒฝ์ฐ, ์ฌ๋ฌ PyTorch ํ๋ก์ธ์ค๋ฅผ ์์ํ๋ ค๋ฉด `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ ์ ์ฌ์ฉํ์ธ์. |
74 | 68 |
|
75 | | - * See also: `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__ |
| 69 | + * ์ฐธ๊ณ : `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ(DDP) ์์ํ๊ธฐ <../intermediate/ddp_tutorial.html>`__ |
76 | 70 |
|
77 | | -#. Use `FullyShardedDataParallel (FSDP2) <https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`__ when your model cannot fit on one GPU. |
| 71 | +#. ๋ชจ๋ธ์ด ๋จ์ผ GPU์ ํ์ฌ๋์ง ์์ ๋๋ `FullyShardedDataParallel (FSDP2) <https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`__ ์ ์ฌ์ฉํ์ธ์. |
78 | 72 |
|
79 | | - * See also: `Getting Started with FSDP2 <https://tutorials.pytorch.kr/intermediate/FSDP_tutorial.html>`__ |
| 73 | + * ์ฐธ๊ณ : `FSDP2 ์์ํ๊ธฐ <https://tutorials.pytorch.kr/intermediate/FSDP_tutorial.html>`__ |
80 | 74 |
|
81 | | -#. Use `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ and/or `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ if you reach scaling limitations with FSDP2. |
| 75 | +#. FSDP2๋ก๋ ํ์ฅ ํ๊ณ์ ๋๋ฌํ ๊ฒฝ์ฐ, `Tensor Parallel (TP, Tensor ๋ณ๋ ฌํ) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ ๋ฐ/๋๋ `Pipeline Parallel (PP, ํ์ดํ๋ผ์ธ ๋ณ๋ ฌํ) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ ๋ฅผ ์ฌ์ฉํ์ธ์. |
82 | 76 |
|
83 | | - * Try our `Tensor Parallelism Tutorial <https://tutorials.pytorch.kr/intermediate/TP_tutorial.html>`__ |
| 77 | + * `Tensor ๋ณ๋ ฌํ ํํ ๋ฆฌ์ผ <https://tutorials.pytorch.kr/intermediate/TP_tutorial.html>`__ ์ ํ์ธํด ๋ณด์ธ์. |
84 | 78 |
|
85 | | - * See also: `TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__ |
| 79 | + * ์ฐธ๊ณ : `TorchTitan 3D ๋ณ๋ ฌํ ์ ์ฒด(end to end) ์์ <https://github.com/pytorch/torchtitan>`__ |
86 | 80 |
|
87 | | -.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__. |
| 81 | +.. note:: ๋ฐ์ดํฐ ๋ณ๋ ฌ ํ์ต์ `์๋ ํผํฉ ์ ๋ฐ๋(AMP, Automatic Mixed Precision) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__ ์ ํจ๊ป์์๋ ๋์ํฉ๋๋ค. |
88 | 82 |
|
89 | 83 |
|
90 | | -PyTorch Distributed Developers |
| 84 | +PyTorch ๋ถ์ฐ ๊ฐ๋ฐ์ |
91 | 85 | ------------------------------ |
92 | 86 |
|
93 | | -If you'd like to contribute to PyTorch Distributed, refer to our |
94 | | -`Developer Guide <https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md>`_. |
| 87 | +PyTorch ๋ถ์ฐ์ ๊ธฐ์ฌํ๊ณ ์ถ๋ค๋ฉด `๊ฐ๋ฐ์ ๊ฐ์ด๋ <https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md>`_ ๋ฅผ ์ฐธ๊ณ ํ์ธ์. |
0 commit comments