Skip to content

Conversation

@kilinchange
Copy link
Collaborator

@kilinchange kilinchange commented Nov 11, 2025

已完成多机训练基础设施的开发,并在双机 16 卡环境下验证了 llama/gpt2 在 fp32 与 bf16 精度下的多维并行训练能力(支持 DDP、TP、SP、PP 及其组合)。

@kilinchange kilinchange force-pushed the feature/multi-node-training branch 5 times, most recently from b4b571c to 30640d6 Compare November 12, 2025 07:52
@kilinchange kilinchange force-pushed the feature/multi-node-training branch 2 times, most recently from bdc5db7 to 181a687 Compare November 19, 2025 12:24
@kilinchange kilinchange force-pushed the feature/multi-node-training branch 8 times, most recently from 6a30c51 to d1cc216 Compare December 4, 2025 09:42
@kilinchange kilinchange force-pushed the feature/multi-node-training branch from d1cc216 to 03f3534 Compare December 4, 2025 10:57
@kilinchange
Copy link
Collaborator Author

kilinchange commented Dec 4, 2025

Llama FP32 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练:
image
单机同规模:
image

Llama BF16 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练:
image
单机同规模:
image

@kilinchange
Copy link
Collaborator Author

kilinchange commented Dec 4, 2025

GPT2 FP32 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练:
image
单机同规模:
image

GPT2 BF16 精度

双机 16 卡开启 ddp+tp+sp+pp 并行训练:
image

单机同规模:
image

@kilinchange
Copy link
Collaborator Author

kilinchange commented Dec 4, 2025

以上述测例中双机 16 卡 llama fp32 为例,主节点运行指令:
./infini_run --nnodes=2 --nproc_per_node=1 --node_rank=0 -- ./gpt2 --device cuda --input_bin /data/shared/InfiniTrain-dev/data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath /data/shared/InfiniTrain-dev/data/llmc/gpt2/gpt2_124M.bin --num_iteration 10 --nthread_per_process 8 --batch_size 40 --total_batch_size 10240 --tensor_parallel 2 --pipeline_parallel 2 --sequence_parallel
从节点运行指令:
./infini_run --nnodes=2 --nproc_per_node=1 --node_rank=1 -- ./gpt2 --device cuda --input_bin /data/shared/InfiniTrain-dev/data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath /data/shared/InfiniTrain-dev/data/llmc/gpt2/gpt2_124M.bin --num_iteration 10 --nthread_per_process 8 --batch_size 40 --total_batch_size 10240 --tensor_parallel 2 --pipeline_parallel 2 --sequence_parallel

唯一区别是传给 infini_run 的 --node_rank 参数值。

@kilinchange kilinchange changed the title [WIP] Feature/multi node training Feature/multi node training Dec 4, 2025
@kilinchange kilinchange force-pushed the feature/multi-node-training branch from 03f3534 to c48ac8d Compare December 5, 2025 03:01
@kilinchange kilinchange self-assigned this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants