-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54907][SS] Introduce NameStreamingSources analyzer rule for streaming source evolution #53684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54907][SS] Introduce NameStreamingSources analyzer rule for streaming source evolution #53684
Conversation
JIRA Issue Information=== Task SPARK-54907 === This comment was automatically generated by GitHub Actions |
2515f71 to
7dd09f9
Compare
…reaming source evolution This PR introduces the `NameStreamingSources` analyzer rule and supporting infrastructure to enable streaming source evolution. This allows streaming queries to add, remove, or reorder sources without losing state by assigning stable names to sources. Key changes: - Added `HasStreamingSourceIdentifyingName` trait for uniform name propagation - Updated `StreamingRelationV2` to support source identifying names - Created `NameStreamingSources` analyzer rule to propagate names from `NamedStreamingRelation` wrappers - Added `spark.sql.streaming.queryEvolution.enableStreamingSourceEvolution` config flag - Added error handling for unnamed sources when enforcement is enabled Currently, streaming sources are identified by their position in the query plan (sources/0, sources/1, etc.). This makes it impossible to add, remove, or reorder sources without breaking checkpoint compatibility. By assigning stable names to sources, we enable: 1. **Source evolution**: Add/remove/reorder sources without losing state 2. **Stable checkpoint locations**: sources/<name> instead of sources/0, sources/1 3. **Better debugging**: Named sources are easier to identify and debug No. The infrastructure is in place but the user-facing `.name()` DataFrame API is not yet exposed. The analyzer rule handles existing `NamedStreamingRelation` nodes that may be created internally. - Added comprehensive unit tests in `NameStreamingSourcesSuite` (15 test cases) - Tests cover name propagation, enforcement checks, error messages, and edge cases - Tests verify behavior with UserProvided, FlowAssigned, and Unassigned names No.
7dd09f9 to
22baf62
Compare
|
@dtenedor PTAL |
dtenedor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline about a couple small mismatches between config names in error messages, etc. LGTM now.
|
LGTM, merging to master. |
…reaming source evolution ## What changes were proposed in this pull request? This PR introduces the `NameStreamingSources` analyzer rule and supporting infrastructure to enable streaming source evolution. This allows streaming queries to add, remove, or reorder sources without losing state by assigning stable names to sources. Key changes: - Added `HasStreamingSourceIdentifyingName` trait for uniform name propagation - Updated `StreamingRelationV2` to support source identifying names - Created `NameStreamingSources` analyzer rule to propagate names from `NamedStreamingRelation` wrappers - Added `spark.sql.streaming.queryEvolution.enableStreamingSourceEvolution` config flag - Added error handling for unnamed sources when enforcement is enabled ## Why are the changes needed? Currently, streaming sources are identified by their position in the query plan (sources/0, sources/1, etc.). This makes it impossible to add, remove, or reorder sources without breaking checkpoint compatibility. By assigning stable names to sources, we enable: 1. **Source evolution**: Add/remove/reorder sources without losing state 2. **Stable checkpoint locations**: sources/<name> instead of sources/0, sources/1 3. **Better debugging**: Named sources are easier to identify and debug ## Does this PR introduce _any_ user-facing change? No. The infrastructure is in place but the user-facing `.name()` DataFrame API is not yet exposed. The analyzer rule handles existing `NamedStreamingRelation` nodes that may be created internally. ## How was this patch tested? - Added comprehensive unit tests in `NameStreamingSourcesSuite` (15 test cases) - Tests cover name propagation, enforcement checks, error messages, and edge cases - Tests verify behavior with UserProvided, FlowAssigned, and Unassigned names ## Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53684 from ericm-db/SPARK-54684-streaming-source-naming. Lead-authored-by: Eric Marnadi <132308037+ericm-db@users.noreply.github.com> Co-authored-by: ericm-db <eric.marnadi@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
What changes were proposed in this pull request?
This PR introduces the
NameStreamingSourcesanalyzer rule and supporting infrastructure to enable streaming source evolution. This allows streaming queries to add, remove, or reorder sources without losing state by assigning stable names to sources.Key changes:
HasStreamingSourceIdentifyingNametrait for uniform name propagationStreamingRelationV2to support source identifying namesNameStreamingSourcesanalyzer rule to propagate names fromNamedStreamingRelationwrappersspark.sql.streaming.queryEvolution.enableStreamingSourceEvolutionconfig flagWhy are the changes needed?
Currently, streaming sources are identified by their position in the query plan (sources/0, sources/1, etc.). This makes it impossible to add, remove, or reorder sources without breaking checkpoint compatibility. By assigning stable names to sources, we enable:
Does this PR introduce any user-facing change?
No. The infrastructure is in place but the user-facing
.name()DataFrame API is not yet exposed. The analyzer rule handles existingNamedStreamingRelationnodes that may be created internally.How was this patch tested?
NameStreamingSourcesSuite(15 test cases)Was this patch authored or co-authored using generative AI tooling?
No.