Skip to content

Add checkpointing support for Tinker SkyRL backend#992

Merged
tyler-griggs merged 4 commits intomainfrom
tyler/tinker-checkpointing-main
Jan 30, 2026
Merged

Add checkpointing support for Tinker SkyRL backend#992
tyler-griggs merged 4 commits intomainfrom
tyler/tinker-checkpointing-main

Conversation

@tyler-griggs
Copy link
Member

@tyler-griggs tyler-griggs commented Jan 30, 2026

Summary

Implements checkpointing for the SkyRL-Train backend in Tinker, enabling periodic checkpoint saves and training resumption.

Changes

Adds three checkpoint methods to SkyRLTrainBackend:

  • save_checkpoint() - Saves full training state (model + optimizer + scheduler) as tar archive
  • load_checkpoint() - Restores full training state from checkpoint
  • save_sampler_checkpoint() - Exports model-only checkpoint in HuggingFace format

Implementation

  • Wraps WorkerDispatch.save_checkpoint() and load_checkpoint() methods
  • Packages FSDP-sharded checkpoints into tar archives for storage
  • Uses uncompressed tar (not gzip) to avoid 5-10 minute compression overhead on 6-7GB checkpoints
  • Preserves optimizer state (Adam momentum/variance) for seamless training resumption

Add full checkpoint save/load functionality to SkyRLTrainBackend:

- save_checkpoint(): Saves model + optimizer + scheduler state as uncompressed tar
- load_checkpoint(): Restores full training state from tar checkpoint
- save_sampler_checkpoint(): Exports model weights in HuggingFace format for inference

Implementation wraps WorkerDispatch checkpoint methods and handles tar packaging.
Uses uncompressed tar to avoid 5-10 minute gzip bottleneck on 6-7GB FSDP checkpoints.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces checkpointing capabilities to the SkyRLTrainBackend, enabling training state to be saved and restored via tar archives. However, a high-severity path traversal vulnerability was identified in the checkpoint loading logic due to the unsafe use of tarfile.extractall(). Additionally, the worker initialization logic uses trust_remote_code=True when loading model configurations, which poses a significant security risk if untrusted model identifiers are processed. Beyond security, there's also a minor race condition and some code duplication that could benefit from refactoring for improved maintainability.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address PR review feedback:

1. Security: Add filter='data' to tarfile.extractall() to prevent path
   traversal (TarSlip) attacks where malicious archives could write
   outside the temp directory

2. Refactor: Extract duplicate validation logic into _validate_model_state()
   helper method (used by all 3 checkpoint methods)

3. Remove redundant os.path.exists() check that creates TOCTOU race
   condition - tarfile.open() already raises FileNotFoundError

4. Refactor: Extract common tar creation logic into
   _create_tar_from_directory() helper method to reduce duplication

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tyler-griggs tyler-griggs merged commit 130d16b into main Jan 30, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants