Skip to content

[SPARK-55817][SQL] Enable Parquet row-group skipping for shredded variant#54598

Open
qlong wants to merge 1 commit intoapache:masterfrom
qlong:SPARK-55817-row-group-skipping
Open

[SPARK-55817][SQL] Enable Parquet row-group skipping for shredded variant#54598
qlong wants to merge 1 commit intoapache:masterfrom
qlong:SPARK-55817-row-group-skipping

Conversation

@qlong
Copy link

@qlong qlong commented Mar 3, 2026

What changes were proposed in this pull request?

When PushVariantIntoScan rewrites variant_get() calls into struct field accesses, the rewritten predicates reference logical paths like "v.0" that ParquetFilters cannot resolve to any physical column, so they are dropped and row-group skipping is disabled for all shredded variant queries.

This change adds variantExtractionSchema to ParquetFilters, and resolves the logical path to the corresponding typed_value leaf in the physical Parquet schema.The resolved entries allow predicates on shredded variant to participate in row-group skipping.

Array-index paths and fields absent from a file's physical schema are skipped.

Jira: https://issues.apache.org/jira/browse/SPARK-55817

Why are the changes needed?

Performance improvement. The shreded variant predicates are pushed down to participate row group filtering.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Unit tests for resolving pushed down shreded variant from logical path to phyical column.
  • Tests to verify that row groups are skipped with parquet filters

Was this patch authored or co-authored using generative AI tooling?

co-authorized with Claude 4.6 Sonnet.

…iant

When PushVariantIntoScan rewrites variant_get() calls into struct field
accesses, the rewritten predicates reference logical paths like "v.`0`"
that ParquetFilters cannot resolve to any physical column, so they are
dropped and row-group skipping is disabled for all shredded
variant queries.

This change adds variantExtractionSchema to ParquetFilters, and resolves
the logical path to the corresponding typed_value leaf in the physical
Parquet schema.The resolved entries allow predicates on shredded variant
to participate in row-group skipping.

Array-index paths and fields absent from a file's physical schema are
skipped.
@qlong qlong force-pushed the SPARK-55817-row-group-skipping branch from 079b01f to ef8bc2f Compare March 3, 2026 21:41
@qlong
Copy link
Author

qlong commented Mar 4, 2026

@chenhao-db Can you take a look at this? It is a follow up to your PR #49235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant