Add checkpointing support for Tinker SkyRL backend by tyler-griggs · Pull Request #992 · NovaSky-AI/SkyRL

tyler-griggs · 2026-01-30T01:49:50Z

Summary

Implements checkpointing for the SkyRL-Train backend in Tinker, enabling periodic checkpoint saves and training resumption.

Changes

Adds three checkpoint methods to SkyRLTrainBackend:

save_checkpoint() - Saves full training state (model + optimizer + scheduler) as tar archive
load_checkpoint() - Restores full training state from checkpoint
save_sampler_checkpoint() - Exports model-only checkpoint in HuggingFace format

Implementation

Wraps WorkerDispatch.save_checkpoint() and load_checkpoint() methods
Packages FSDP-sharded checkpoints into tar archives for storage
Uses uncompressed tar (not gzip) to avoid 5-10 minute compression overhead on 6-7GB checkpoints
Preserves optimizer state (Adam momentum/variance) for seamless training resumption

Add full checkpoint save/load functionality to SkyRLTrainBackend: - save_checkpoint(): Saves model + optimizer + scheduler state as uncompressed tar - load_checkpoint(): Restores full training state from tar checkpoint - save_sampler_checkpoint(): Exports model weights in HuggingFace format for inference Implementation wraps WorkerDispatch checkpoint methods and handles tar packaging. Uses uncompressed tar to avoid 5-10 minute gzip bottleneck on 6-7GB FSDP checkpoints. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces checkpointing capabilities to the SkyRLTrainBackend, enabling training state to be saved and restored via tar archives. However, a high-severity path traversal vulnerability was identified in the checkpoint loading logic due to the unsafe use of tarfile.extractall(). Additionally, the worker initialization logic uses trust_remote_code=True when loading model configurations, which poses a significant security risk if untrusted model identifiers are processed. Beyond security, there's also a minor race condition and some code duplication that could benefit from refactoring for improved maintainability.

skyrl-tx/tx/tinker/backends/skyrl_train.py

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Address PR review feedback: 1. Security: Add filter='data' to tarfile.extractall() to prevent path traversal (TarSlip) attacks where malicious archives could write outside the temp directory 2. Refactor: Extract duplicate validation logic into _validate_model_state() helper method (used by all 3 checkpoint methods) 3. Remove redundant os.path.exists() check that creates TOCTOU race condition - tarfile.open() already raises FileNotFoundError 4. Refactor: Extract common tar creation logic into _create_tar_from_directory() helper method to reduce duplication Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview January 30, 2026 01:50 View deployment

Format: Consolidate function call arguments to single line

4125427

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview January 30, 2026 01:58 View deployment

tyler-griggs marked this pull request as ready for review January 30, 2026 02:06

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

Shorten checkpoint compression comment

84feed7

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview January 30, 2026 02:11 View deployment

vercel bot deployed to Preview January 30, 2026 17:22 View deployment

tyler-griggs merged commit 130d16b into main Jan 30, 2026
5 of 6 checks passed

tyler-griggs mentioned this pull request Jan 31, 2026

Add sampling support for Tinker SkyRL backend #999

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checkpointing support for Tinker SkyRL backend#992

Add checkpointing support for Tinker SkyRL backend#992
tyler-griggs merged 4 commits intomainfrom
tyler/tinker-checkpointing-main

tyler-griggs commented Jan 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tyler-griggs commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Implementation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-griggs commented Jan 30, 2026 •

edited

Loading