-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Description
I saw Megatron-LM has supported asynchronous checkpoint saving since v0.7.0.
@sbak5 I did some test on this feature and saw it benefits a lot. I tried to dive into it and found the ckpt's format has changed a lot compared to the synchronous saving.
Just 3 questions:
- What is the meaning of
__0_0.distcpand__0_1.distcp? There is no readme or blog about this feature. Could you please explain it? - How to convert this format to the synchronous saving format? Such as
distrib_optim.ptandmodel_optim_rng.pt. - How to convert this format to HuggingFace .bin format in order to do inference?
Thanks for your help ^_^
root@inp11049767626817836924-2-1:/home/Megatron-LM/fp8_async_save_te1.7_outputs/checkpoint/8B-lr1e-4-tp1-pp4# tree -lh
.
├── [ 325] iter_0000010
│ ├── [2.4G] __0_0.distcp
│ ├── [ 31G] __0_1.distcp
│ ├── [1.8G] __2_0.distcp
│ ├── [ 23G] __2_1.distcp
│ ├── [1.8G] __4_0.distcp
│ ├── [ 23G] __4_1.distcp
│ ├── [2.4G] __6_0.distcp
│ ├── [ 31G] __6_1.distcp
│ ├── [ 15K] common.pt
│ └── [ 119] metadata.json
├── [ 325] iter_0000020
│ ├── [2.4G] __0_0.distcp
│ ├── [ 31G] __0_1.distcp
│ ├── [1.8G] __2_0.distcp
│ ├── [ 23G] __2_1.distcp
│ ├── [1.8G] __4_0.distcp
│ ├── [ 23G] __4_1.distcp
│ ├── [2.4G] __6_0.distcp
│ ├── [ 31G] __6_1.distcp
│ ├── [ 15K] common.pt
│ └── [ 119] metadata.json
├── [ 325] iter_0000030
│ ├── [2.4G] __0_0.distcp
│ ├── [ 31G] __0_1.distcp
│ ├── [1.8G] __2_0.distcp
│ ├── [ 23G] __2_1.distcp
│ ├── [1.8G] __4_0.distcp
│ ├── [ 23G] __4_1.distcp
│ ├── [2.4G] __6_0.distcp
│ ├── [ 31G] __6_1.distcp
│ ├── [ 15K] common.pt
│ └── [ 119] metadata.json
└── [ 2] latest_checkpointed_iteration.txt
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels