Skip to content

[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577

Open
ericm-db wants to merge 1 commit intoapache:masterfrom
ericm-db:streaming-offset-v1-to-v2-migration
Open

[SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources#54577
ericm-db wants to merge 1 commit intoapache:masterfrom
ericm-db:streaming-offset-v1-to-v2-migration

Conversation

@ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Mar 2, 2026

What changes were proposed in this pull request?

This PR introduces an automatic offset log upgrade mechanism that allows streaming queries to migrate from V1 (positional) offset tracking to V2 (named) offset tracking when users add .name() to their streaming sources.

Key components:

  1. OffsetSeq.toOffsetMap() - Converts V1 positional offsets to V2 named offsets using provided source names

    • Validates source count matches between offset log and current plan
    • Detects and prevents duplicate source names that would cause data loss
    • Migrates V1 metadata to V2 format
  2. MicroBatchExecution upgrade logic - Orchestrates the automatic offset log upgrade

    • Only upgrades when ALL conditions are met:
      • Current offset log is V1
      • User explicitly requests V2 via spark.sql.streaming.offsetLog.formatVersion=2
      • All sources are named (not Unassigned)
      • User sets spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true
    • Creates an "upgrade batch" that converts and commits the new offset log format
    • Fails loudly with clear error message if upgrade config not set
    • Skips upgrade if uncommitted batch exists (requires clean state)
  3. Safety validations:

    • Source count mismatch detection
    • Duplicate source name detection (two layers)
    • Concurrent modification detection
    • Clean state requirement (no uncommitted batches)
  4. Comprehensive test suite - OffsetLogV1ToV2UpgradeSuite with tests for:

    • Happy path upgrade with multiple sources
    • No upgrade when sources unnamed
    • V2 offset log stability
    • Multi-source offset mapping correctness
    • Source count mismatch error handling
    • Missing upgrade config error handling

Why are the changes needed?

Currently, when users want to migrate from V1 (index-based) to V2 (name-based) offset tracking, they must:

  1. Delete their checkpoint directory (losing all state)
  2. Start fresh

This is problematic because:

  • State loss: All stateful operators (aggregations, joins, deduplication) lose their state
  • Data reprocessing: Query must reprocess all historical data from the beginning
  • Downtime: Requires stopping the query and careful coordination

With this change, users can safely migrate existing V1 offset logs to V2 format by:

  1. Adding .name() to all streaming sources
  2. Setting spark.sql.streaming.offsetLog.formatVersion=2
  3. Setting spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true
  4. Restarting the query

The upgrade preserves all state and offset positions, enabling seamless transition to the more flexible V2 format that supports source evolution (adding/removing sources by name).

Does this PR introduce any user-facing change?

Yes. This PR introduces two new behaviors:

1. New config (default: false)

spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=false

When set to true, enables automatic V1 to V2 offset log upgrade when conditions are met.

2. New error message when upgrade needed but not enabled:

Previous behavior: Query would continue with V1 format (or fail with unclear error)

New behavior: Clear error message when V1 offset log exists, V2 requested, and upgrade config not set:

IllegalStateException: Offset log is in V1 format but V2 format was requested via 
spark.sql.streaming.offsetLog.formatVersion=2. To migrate the offset log, set 
spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true. 
Important: This is a one-way migration that cannot be rolled back. 
Ensure all batches are committed before enabling. See documentation for details.

This is a backwards compatible change - existing V1 queries continue working unchanged unless users explicitly opt into the upgrade.

How was this patch tested?

Added comprehensive test suite OffsetLogV1ToV2UpgradeSuite with the following test cases:

  1. V1 offset log + all sources named auto-upgrades to V2

    • Creates V1 offset log with unnamed sources
    • Restarts with named sources + V2 config + upgrade config
    • Verifies upgrade occurs and offsets are keyed by name
  2. V1 offset log + no sources named continues with V1

    • Verifies no upgrade when sources remain unnamed
  3. Already V2 offset log + named sources continues with V2

    • Verifies stability (no regression) for existing V2 offset logs
  4. Multi-source upgrade preserves all offsets correctly

    • Tests 3-source upgrade with names "payments", "refunds", "adjustments"
    • Verifies all offsets correctly mapped by name
  5. Source count mismatch throws clear error

    • Creates V1 offset log with 2 sources
    • Attempts upgrade with 3 sources
    • Verifies clear error message about source count mismatch
  6. V1 offset log + V2 requested without upgrade config throws clear error

    • Verifies the new error message when upgrade config not set
    • Ensures users get clear guidance on what config to set

All tests use real file-based streaming sources to ensure end-to-end correctness.

Was this patch authored or co-authored using generative AI tooling?

No.

@ericm-db ericm-db force-pushed the streaming-offset-v1-to-v2-migration branch 2 times, most recently from 6f02737 to 9768076 Compare March 2, 2026 18:04
@ericm-db ericm-db changed the title Streaming offset v1 to v2 migration [SPARK-XXXXX][SS] Add automatic V1 to V2 checkpoint upgrade for streaming queries with named sources Mar 2, 2026
@ericm-db ericm-db changed the title [SPARK-XXXXX][SS] Add automatic V1 to V2 checkpoint upgrade for streaming queries with named sources [SPARK-XXXXX][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources Mar 2, 2026
@ericm-db ericm-db changed the title [SPARK-XXXXX][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources [SPARK-55795][SS] Add automatic V1 to V2 offset log upgrade for streaming queries with named sources Mar 2, 2026
Add automatic upgrade mechanism for Spark Structured Streaming offset logs, migrating from V1 (position-based) to V2 (name-based) format without losing state or requiring checkpoint deletion.

Changes:
- Add config spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled
- Add SupportsOffsetLogUpgrade trait for sources to handle metadata migration
- Implement upgrade logic in MicroBatchExecution with two paths:
  - Positional: V1 to V2 with keys "0", "1", "2"
  - Named: V1 to V2 with actual source names, migrating metadata directories
- FileStreamSource implements metadata migration by copying all batches from old to new paths
- Add OffsetSeq.toOffsetMap() conversion method
- Add comprehensive test suite with 7 passing tests

Co-Authored-By: Claude <noreply@anthropic.com>
@ericm-db ericm-db force-pushed the streaming-offset-v1-to-v2-migration branch from ade713c to 9eab4f6 Compare March 3, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant