Skip to content

[SPARK-55820] [SS] Fix race condition in no-overwrite FS when RocksDB version files cache is out of sync#54602

Open
liviazhu wants to merge 1 commit intoapache:masterfrom
liviazhu:liviazhu-db/rocksdb-cleanup-fix
Open

[SPARK-55820] [SS] Fix race condition in no-overwrite FS when RocksDB version files cache is out of sync#54602
liviazhu wants to merge 1 commit intoapache:masterfrom
liviazhu:liviazhu-db/rocksdb-cleanup-fix

Conversation

@liviazhu
Copy link
Contributor

@liviazhu liviazhu commented Mar 3, 2026

What changes were proposed in this pull request?

There exists the following race condition on no-overwrite filesystems if minVersionsToRetain <= minDeltasForSnapshot:

  1. Query run 1 uploads snapshot X.zip pointing to SST file Y.SST
  2. Query run 1 is cancelled before the commit log is written
  3. Query run 2 retries the batch, uploads Z.SST, tries to re-upload X.zip pointing to Z.SST. The zip overwrite silently fails on no-overwrite FS, but versionToRocksDBFiles maps version X -> Z.SST (stale)
  4. Maintenance/cleanup uses the stale in-memory mapping, sees Y.SST as untracked, and deletes it
  5. A subsequent query run tries to load X.zip from cloud, which still references Y.SST -> FileNotFoundException

This change fixes the race condition by opening the min retained version on DFS and not cleaning up the files referenced there rather than relying on the cache.

Why are the changes needed?

Fix race condition leading to FileNotFound error

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit test

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude 4.5

@liviazhu liviazhu changed the title [SPARK-55820] [SS] Fix race condition in no-overwrite FS when RocksDB version files cach… [SPARK-55820] [SS] Fix race condition in no-overwrite FS when RocksDB version files cache is out of sync Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant