Skip to content

Conversation

@deepikarajani24
Copy link

Summary

This PR updates the TensorFlow and Orbax dependencies to support the new HNS-native RenameFolder API.

Changes

To leverage this feature for HNS buckets, you'd need to configure ocp.CheckpointManagerOptions with todelete_full_path="_trash".

  • Impact: When max_to_keep is exceeded, old checkpoints are now atomically moved to a _trash subdirectory instead of being deleted.

Context & Motivation

  • TensorFlow Support: TensorFlow has added support for the HNS RenameFolder API, allowing for recursive, atomic directory moves.
  • Orbax Integration: Orbax now exposes a todelete_full_path option in CheckpointManagerOptions. When enabled, Orbax delegates to tf.io.gfile.rename to move old checkpoints to a trash directory rather than performing a slow, object-by-object deletion.
  • Performance: On HNS buckets, renaming a folder is significantly faster than standard deletion.

Validation

Scale testing was conducted on Axlearn workloads using this configuration. Results confirmed that the rename operations were significantly faster than the previous deletion mechanism, reducing overhead during checkpoint rotation.

Configuration Snippet

options=ocp.CheckpointManagerOptions(
    create=True,
    max_to_keep=cfg.keep_last_n,
    enable_async_checkpointing=True,
    step_name_format=self._name_format,
    should_save_fn=save_fn_with_summaries,
    enable_background_delete=True,
    async_options=ocp.options.AsyncOptions(timeout_secs=cfg.async_timeout_secs),
    # New HNS optimization:
    todelete_full_path="_trash",
)

@deepikarajani24 deepikarajani24 requested a review from a team as a code owner November 22, 2025 01:49
@deepikarajani24 deepikarajani24 marked this pull request as draft November 22, 2025 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant