Skip to content

Conversation

@aliasaria
Copy link
Member

@aliasaria aliasaria commented Aug 1, 2025

This implementation works by adding a parameter as an "input_override" to a task being queued. It sends the old training data but adds "restart_from_checkpoint" and gives a checkpoint name. This parameter will be sent to the trainer script which needs to handle it if it is a script that supports checkpointing

@aliasaria
Copy link
Member Author

Let's close this and not merge. We are doing parallel work for orchestration for checkpoints so maybe we can do it more generally there.

@aliasaria aliasaria closed this Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants