Skip to content

Feature: Prometheus metrics for rustic forget and rustic forget --prune #1713

@zoechi

Description

@zoechi

I just learned that metrics are only emitted for rustic backup but not for rustic forget #1505 (reply in thread)

Summary

rustic backup emits a rich set of Prometheus metrics (timing, data volume, blob counts, file/dir change stats). rustic forget and rustic forget --prune currently emit no metrics, creating an observability blind spot: you cannot alert on forgotten snapshots, track storage reclamation, or measure prune operation duration.

This issue proposes a set of metrics for both subcommands, modeled on the existing rustic_backup_* namespace.


Existing backup metrics — mapping to forget

✅ Direct equivalents (same concept, renamed)

Backup metric Proposed forget equivalent Notes
rustic_backup_time rustic_forget_time Unix timestamp when operation started
rustic_backup_backup_start rustic_forget_forget_start Unix timestamp of forget phase start
rustic_backup_backup_end rustic_forget_forget_end Unix timestamp of forget phase end
rustic_backup_backup_duration rustic_forget_forget_duration Duration of forget phase (seconds)
rustic_backup_total_duration rustic_forget_total_duration Total duration incl. prune phase (seconds)

For --prune, add a separate timing pair for the prune phase:

Proposed metric Notes
rustic_forget_prune_start Unix timestamp of prune phase start
rustic_forget_prune_end Unix timestamp of prune phase end
rustic_forget_prune_duration Duration of prune phase only (seconds)

↔️ Opposite-direction equivalents (prune only)

Backup adds data; prune removes it. These mirror the data_added family but in reverse. Only emitted when --prune is active.

Backup metric Proposed forget/prune equivalent Notes
rustic_backup_data_added rustic_forget_data_removed Total bytes freed (uncompressed)
rustic_backup_data_added_files rustic_forget_data_removed_files File bytes freed (uncompressed)
rustic_backup_data_added_trees rustic_forget_data_removed_trees Tree/dir bytes freed (uncompressed)
rustic_backup_data_added_packed rustic_forget_data_removed_packed Storage bytes freed (after compression)
rustic_backup_data_added_files_packed rustic_forget_data_removed_files_packed File storage freed (compressed)
rustic_backup_data_added_trees_packed rustic_forget_data_removed_trees_packed Tree storage freed (compressed)
rustic_backup_data_blobs rustic_forget_data_blobs_removed Count of data blobs deleted
rustic_backup_tree_blobs rustic_forget_tree_blobs_removed Count of tree blobs deleted

❌ Not applicable for forget

These backup metrics relate to filesystem scanning, which forget does not perform — no equivalents needed:

  • rustic_backup_files_new / _changed / _unmodified
  • rustic_backup_dirs_new / _changed / _unmodified
  • rustic_backup_total_files_processed
  • rustic_backup_total_dirs_processed
  • rustic_backup_total_bytes_processed
  • rustic_backup_total_dirsize_processed

New forget-specific metrics

Snapshot counts (core purpose of forget)

Metric Notes
rustic_forget_snapshots_total Total snapshots in repo before forget
rustic_forget_snapshots_removed Snapshots removed this run
rustic_forget_snapshots_kept Snapshots kept this run

Retention policy breakdown

Knowing why snapshots were kept helps validate policy configuration. Either as separate metrics or a single metric with a reason label:

rustic_forget_snapshots_kept{reason="last"}
rustic_forget_snapshots_kept{reason="hourly"}
rustic_forget_snapshots_kept{reason="daily"}
rustic_forget_snapshots_kept{reason="weekly"}
rustic_forget_snapshots_kept{reason="monthly"}
rustic_forget_snapshots_kept{reason="yearly"}
rustic_forget_snapshots_kept{reason="tag"}
rustic_forget_snapshots_kept{reason="within"}

A single labeled metric is preferable to 8 separate metrics — easier to query with sum by (reason) and fewer time series.

Pack file metrics (prune only)

Metric Notes
rustic_forget_packs_removed Pack files fully deleted (all blobs unused)
rustic_forget_packs_rewritten Pack files rewritten (partial blob removal)
rustic_forget_packs_kept Pack files left untouched

Full proposed metric list

rustic forget (without --prune)

rustic_forget_time
rustic_forget_forget_start
rustic_forget_forget_end
rustic_forget_forget_duration
rustic_forget_total_duration
rustic_forget_snapshots_total
rustic_forget_snapshots_removed
rustic_forget_snapshots_kept{reason="..."}

Additional metrics for rustic forget --prune

rustic_forget_prune_start
rustic_forget_prune_end
rustic_forget_prune_duration
rustic_forget_data_removed
rustic_forget_data_removed_files
rustic_forget_data_removed_trees
rustic_forget_data_removed_packed
rustic_forget_data_removed_files_packed
rustic_forget_data_removed_trees_packed
rustic_forget_data_blobs_removed
rustic_forget_tree_blobs_removed
rustic_forget_packs_removed
rustic_forget_packs_rewritten
rustic_forget_packs_kept

Alerting use cases enabled

With these metrics, the following Prometheus alerts become possible:

  • Forget not running: time() - rustic_forget_time > 86400
  • Prune taking too long: rustic_forget_prune_duration > 3600
  • Snapshot count growing unbounded: rustic_forget_snapshots_total > threshold
  • Policy misconfiguration (no snapshots being removed): rustic_forget_snapshots_removed == 0 sustained over time
  • Storage not being freed despite removals: rustic_forget_data_removed_packed == 0 when rustic_forget_snapshots_removed > 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    S-triageStatus: Waiting for a maintainer to triage this issue/PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions