Skip to content
Discussion options

You must be logged in to vote
  • What is the meaning of __0_0.distcp and __0_1.distcp? There is no readme or blog about this feature. Could you please explain it?

It has the format as .distcp, global rank is straightforward. It's global rank in the default process group created by Pytorch. We create multiple writer processes to make checkpoint writing with asynchrony. writer process ID simply indicates where the corresponding checkpoint is from.

  • How to convert this format to the synchronous saving format? Such as distrib_optim.pt and model_optim_rng.pt.

The previous checkpoint format in Megatron-LM was converted due to the introduction of dist_checkpointing by @mikolajblaz. So, synchronous checkpointing (--use-di…

Replies: 13 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by dimapihtar
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
7 participants
Converted from issue

This discussion was converted from issue #964 on September 04, 2024 18:34.