Skip to content

Add checkpointing support for Tinker SkyRL backend#990

Closed
tyler-griggs wants to merge 2 commits intotyler/tinker-sft-integrationfrom
tyler/tinker-checkpointing-worktree
Closed

Add checkpointing support for Tinker SkyRL backend#990
tyler-griggs wants to merge 2 commits intotyler/tinker-sft-integrationfrom
tyler/tinker-checkpointing-worktree

Conversation

@tyler-griggs
Copy link
Member

Summary

Implements checkpointing for the Tinker SkyRL backend, enabling sl_loop.py to run unchanged with full checkpoint save/load/resume functionality.

Builds on PR #986 (Tinker API server integration)

Changes

1. Checkpoint Implementation (skyrl_train/tinker/backends/skyrl_train.py)

Implemented 3 checkpoint methods (~70 lines):

save_checkpoint(output_path, model_id)

  • Saves full training state (model + optimizer + scheduler) as tar archive
  • Calls WorkerDispatch.save_checkpoint() which handles FSDP distributed checkpointing
  • Creates tar archive containing:
    • Per-rank model shards (4 files: model_world_size_4_rank_*.pt)
    • Per-rank optimizer state (Adam momentum/variance)
    • Per-rank extra state (scheduler + RNG)
    • HuggingFace configs + tokenizer
    • LoRA adapter (if using LoRA)

load_checkpoint(checkpoint_path, model_id)

  • Loads full training state from tar archive
  • Restores model weights, optimizer state, scheduler state, and RNG state
  • Enables seamless training resumption

save_sampler_checkpoint(output_path, model_id)

  • Exports HuggingFace format for inference/sampling (model only, no optimizer)
  • Uses WorkerDispatch.save_hf_model() for clean model export

2. Performance Optimization

Uncompressed tar instead of gzip (commit 181518b):

  • FSDP checkpoints are already large (6-7GB with 4 GPU ranks)
  • Gzip compression adds 5-10 minutes of single-threaded CPU time
  • Training is completely blocked during checkpoint save
  • Uncompressed tar is much faster while still packaging the checkpoint

3. Bug Fixes

Parent directory creation (commit 29dd0ad):

  • Added os.makedirs(os.path.dirname(output_path), exist_ok=True)
  • Prevents "No such file or directory" errors when checkpoint base path doesn't exist

Architecture

sl_loop.py (tinker-cookbook)
  └─ checkpoint_utils.save_checkpoint(training_client, ...)
       └─ training_client.save_state(name)
            └─ HTTP POST /api/v1/save_weights
                 └─ engine.process_save_weights()
                      └─ backend.save_checkpoint(output_path, model_id)
                           └─ WorkerDispatch.save_checkpoint("policy", ...)
                                └─ FSDP saves per-rank shards

Expected Usage

With this PR, sl_loop.py can run unchanged with checkpointing:

# Start Tinker API server
cd ~/SkyRL-checkpointing
CUDA_VISIBLE_DEVICES=4,5,6,7 uv run --extra vllm python -m skyrl_train.tinker.api \
    --base-model Qwen/Qwen3-0.6B \
    --backend skyrl_train \
    --port 8001

# Run training with periodic checkpoints
cd ~/tinker-cookbook
TINKER_API_KEY=test uv run python -m tinker_cookbook.recipes.sl_loop \
    base_url="http://localhost:8001" \
    model_name="Qwen/Qwen3-0.6B" \
    batch_size=4 \
    lora_rank=8 \
    save_every=20

Checkpointing behavior:

  • Saves checkpoint every 20 batches (configurable with save_every)
  • Checkpoint metadata written to checkpoints.jsonl
  • On restart, automatically resumes from last checkpoint
  • Optimizer state preserved (no loss spikes on resume)

Testing Status

⚠️ Full end-to-end testing blocked by Ray worker startup issues (unrelated to this PR).

What's verified:

  • ✅ Code compiles and loads correctly
  • ✅ Checkpoint methods implement correct API contracts
  • ✅ FSDP checkpoint format verified (from WorkerDispatch investigation)
  • ✅ Tar archiving logic tested manually

What needs testing (once Ray issues resolved):

  • Checkpoint save completes successfully
  • Checkpoint load restores training state correctly
  • Training resumes from correct batch with correct loss
  • Optimizer state preserved (no loss spike on resume)

Files Changed

  • skyrl_train/tinker/backends/skyrl_train.py - Implemented 3 checkpoint methods (~80 lines)

Next Steps

  1. Resolve Ray worker startup issues (appears to be environment-specific)
  2. Run end-to-end test with sl_loop.py
  3. Verify checkpointing works as expected
  4. Consider async checkpoint saving to avoid blocking training (future optimization)

Performance Notes

Checkpoint save time (estimated):

  • FSDP save: ~30-60 seconds (4 GPU ranks saving shards)
  • Tar packaging: ~30-60 seconds (6-7GB uncompressed)
  • Total: ~1-2 minutes (blocks training during this time)

Future optimization: Move checkpoint saving to background thread to avoid blocking training loop.

🤖 Generated with Claude Code

tyler-griggs and others added 2 commits January 29, 2026 02:09
Add support for saving and loading training checkpoints, enabling
sl_loop.py to run unchanged with full checkpoint/resume functionality.

Key changes:
- save_checkpoint(): Saves full training state (model + optimizer + scheduler)
  as tar.gz archive
- load_checkpoint(): Loads training state from tar.gz and restores optimizer
  and scheduler states for seamless resume
- save_sampler_checkpoint(): Exports HuggingFace format for inference/sampling
  (model only, no optimizer)

Implementation leverages existing WorkerDispatch checkpoint methods:
- Uses FSDP's distributed checkpoint format (per-rank sharded files)
- Automatically includes LoRA adapter state
- Preserves RNG state for reproducibility

This enables:
- Periodic checkpoint saves during training
- Resume training from last checkpoint
- Optimizer state preservation (no loss spikes on resume)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Checkpoints are already large (6-7GB with FSDP sharding), and gzip
compression adds 5-10 minutes of single-threaded CPU time that blocks
training. Uncompressed tar is much faster.

Future optimization: move checkpoint saving to async background thread.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tyler-griggs
Copy link
Member Author

Closing this PR - replaced by new PR based on origin/main instead of tyler/tinker-sft-integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant