[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577
Open
ericm-db wants to merge 1 commit intoapache:masterfrom
Open
Conversation
6f02737 to
9768076
Compare
Add automatic upgrade mechanism for Spark Structured Streaming offset logs, migrating from V1 (position-based) to V2 (name-based) format without losing state or requiring checkpoint deletion. Changes: - Add config spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled - Add SupportsOffsetLogUpgrade trait for sources to handle metadata migration - Implement upgrade logic in MicroBatchExecution with two paths: - Positional: V1 to V2 with keys "0", "1", "2" - Named: V1 to V2 with actual source names, migrating metadata directories - FileStreamSource implements metadata migration by copying all batches from old to new paths - Add OffsetSeq.toOffsetMap() conversion method - Add comprehensive test suite with 7 passing tests Co-Authored-By: Claude <noreply@anthropic.com>
ade713c to
9eab4f6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces an automatic offset log upgrade mechanism that allows streaming queries to migrate from V1 (positional) offset tracking to V2 (named) offset tracking when users add
.name()to their streaming sources.Key components:
OffsetSeq.toOffsetMap() - Converts V1 positional offsets to V2 named offsets using provided source names
MicroBatchExecution upgrade logic - Orchestrates the automatic offset log upgrade
spark.sql.streaming.offsetLog.formatVersion=2Unassigned)spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=trueSafety validations:
Comprehensive test suite - OffsetLogV1ToV2UpgradeSuite with tests for:
Why are the changes needed?
Currently, when users want to migrate from V1 (index-based) to V2 (name-based) offset tracking, they must:
This is problematic because:
With this change, users can safely migrate existing V1 offset logs to V2 format by:
.name()to all streaming sourcesspark.sql.streaming.offsetLog.formatVersion=2spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=trueThe upgrade preserves all state and offset positions, enabling seamless transition to the more flexible V2 format that supports source evolution (adding/removing sources by name).
Does this PR introduce any user-facing change?
Yes. This PR introduces two new behaviors:
1. New config (default: false)
When set to
true, enables automatic V1 to V2 offset log upgrade when conditions are met.2. New error message when upgrade needed but not enabled:
Previous behavior: Query would continue with V1 format (or fail with unclear error)
New behavior: Clear error message when V1 offset log exists, V2 requested, and upgrade config not set:
This is a backwards compatible change - existing V1 queries continue working unchanged unless users explicitly opt into the upgrade.
How was this patch tested?
Added comprehensive test suite
OffsetLogV1ToV2UpgradeSuitewith the following test cases:V1 offset log + all sources named auto-upgrades to V2
V1 offset log + no sources named continues with V1
Already V2 offset log + named sources continues with V2
Multi-source upgrade preserves all offsets correctly
Source count mismatch throws clear error
V1 offset log + V2 requested without upgrade config throws clear error
All tests use real file-based streaming sources to ensure end-to-end correctness.
Was this patch authored or co-authored using generative AI tooling?
No.