Feature/multi node training #90

kilinchange · 2025-11-11T02:17:45Z

已完成多机训练基础设施的开发，并在双机 16 卡环境下验证了 llama/gpt2 在 fp32 与 bf16 精度下的多维并行训练能力（支持 DDP、TP、SP、PP 及其组合）。

… cleanup logic

…ad and multi-node multi-process scenarios

…rocessGroup destructor.

kilinchange · 2025-12-04T11:09:07Z

Llama FP32 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练：

单机同规模：

Llama BF16 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练：

单机同规模：

kilinchange · 2025-12-04T11:35:31Z

GPT2 FP32 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练：

单机同规模：

GPT2 BF16 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练：

单机同规模：

kilinchange · 2025-12-04T11:43:11Z

以上述测例中双机 16 卡 llama fp32 为例，主节点运行指令：
./infini_run --nnodes=2 --nproc_per_node=1 --node_rank=0 -- ./gpt2 --device cuda --input_bin /data/shared/InfiniTrain-dev/data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath /data/shared/InfiniTrain-dev/data/llmc/gpt2/gpt2_124M.bin --num_iteration 10 --nthread_per_process 8 --batch_size 40 --total_batch_size 10240 --tensor_parallel 2 --pipeline_parallel 2 --sequence_parallel
从节点运行指令：
./infini_run --nnodes=2 --nproc_per_node=1 --node_rank=1 -- ./gpt2 --device cuda --input_bin /data/shared/InfiniTrain-dev/data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath /data/shared/InfiniTrain-dev/data/llmc/gpt2/gpt2_124M.bin --num_iteration 10 --nthread_per_process 8 --batch_size 40 --total_batch_size 10240 --tensor_parallel 2 --pipeline_parallel 2 --sequence_parallel

唯一区别是传给 infini_run 的 --node_rank 参数值。

kilinchange force-pushed the feature/multi-node-training branch 5 times, most recently from b4b571c to 30640d6 Compare November 12, 2025 07:52

kilinchange force-pushed the feature/multi-node-training branch 2 times, most recently from bdc5db7 to 181a687 Compare November 19, 2025 12:24

kilinchange force-pushed the feature/multi-node-training branch 8 times, most recently from 6a30c51 to d1cc216 Compare December 4, 2025 09:42

kilinchange added 14 commits December 4, 2025 10:33

feat: use shared file to distribute ncclUniqueId in infini_run

69c7097

feat: add multi-node default ProcessGroup

491e0ae

fix: make multi-node DDP precision work

64a457b

chore: add comments in global.h

dfdda73

feat: support multi-node DDP + TP + SP parallel training

b473d69

refactor: refactor the naming of ncclUniqueId files and add a unified…

890e683

… cleanup logic

feat: Separate ProcessGroup initialization for single-node multi-thre…

74da67e

…ad and multi-node multi-process scenarios

feat: Move the logic for cleaning up the ncclUniqueId file into the P…

d57d352

…rocessGroup destructor.

feat: support multi-node pp training

1730267

feat: support multi-node pp training

b0f82e2

fix: support passing the device_id of the PipelineStage

9878a73

feat: support 3d parallel in multi-node training

f90214d

fix: fix rebase bug

7f6c634

refactor: refactor the loss allreduce logic

cd00ab1

kilinchange force-pushed the feature/multi-node-training branch from d1cc216 to 03f3534 Compare December 4, 2025 10:57

kilinchange requested review from Chamberlain0w0 and JYMiracle305 December 4, 2025 11:47

kilinchange changed the title ~~[WIP] Feature/multi node training~~ Feature/multi node training Dec 4, 2025

fix: fix gpt2 and llama multi-node training

c48ac8d

kilinchange force-pushed the feature/multi-node-training branch from 03f3534 to c48ac8d Compare December 5, 2025 03:01

kilinchange self-assigned this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/multi node training #90

Feature/multi node training #90

Uh oh!

kilinchange commented Nov 11, 2025 •

edited

Loading

Uh oh!

kilinchange commented Dec 4, 2025 •

edited

Loading

Uh oh!

kilinchange commented Dec 4, 2025 •

edited

Loading

Uh oh!

kilinchange commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/multi node training #90

Are you sure you want to change the base?

Feature/multi node training #90

Uh oh!

Conversation

kilinchange commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kilinchange commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Llama FP32 精度

Llama BF16 精度

Uh oh!

kilinchange commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPT2 FP32 精度

GPT2 BF16 精度

Uh oh!

kilinchange commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilinchange commented Nov 11, 2025 •

edited

Loading

kilinchange commented Dec 4, 2025 •

edited

Loading

kilinchange commented Dec 4, 2025 •

edited

Loading

kilinchange commented Dec 4, 2025 •

edited

Loading