Skip to content

fix: Add schema validation for native_datafusion Parquet scan#3902

Open
vaibhawvipul wants to merge 2 commits intoapache:mainfrom
vaibhawvipul:vipul-issue-3720
Open

fix: Add schema validation for native_datafusion Parquet scan#3902
vaibhawvipul wants to merge 2 commits intoapache:mainfrom
vaibhawvipul:vipul-issue-3720

Conversation

@vaibhawvipul
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #3720 .

Rationale for this change

The native_datafusion Parquet scan path silently produces wrong results or misses errors when the read schema is incompatible with the actual file schema. Spark's vectorized reader throws SchemaColumnConvertNotSupportedException for these cases (e.g., reading binary as timestamp, reading a scalar as an array, decimal precision mismatches), but Comet's native scan bypassed these checks entirely.

What changes are included in this PR?

Adds per-file Parquet schema validation to both scan paths (CometNativeScanExec and CometScanExec) by reading each file's footer and checking columns against the read schema

How are these changes tested?

all expected tests in the issue, pass.

@vaibhawvipul vaibhawvipul changed the title fix 3720 fix: Add schema validation for native_datafusion Parquet scan Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types

1 participant