Fix/sparse schema and allow column drops by hentzthename · Pull Request #15 · sidequery/dlt-iceberg

hentzthename · 2026-04-07T20:00:56Z

Hi Nico, side note, thanks for your work on the _dlt_loads table, I was looking for it, and all I had to do was install your latest release 😄

The 'sparse data' problem may sound similar to my previous PR: #10

But that PR addressed the scenario where dlt didn't know about the existing wide schema in a fresh container. Once the schema is known (either from _dlt_version or derivation), I discovered these downstream problems:

Problems

When ingesting sparse data (subsequent runs with fewer columns than the established schema), 3 problems surface:

False SchemaEvolutionError -- validate_schema_changes raises when columns are "dropped" (present in table but absent in incoming data), even though the incoming data is not requesting a schema change. The columns should remain in the Iceberg table and new rows should receive nulls.
```
SchemaEvolutionError: Schema evolution validation failed:
  - Columns dropped (not safe): col1, col2, ...
```
cast_table_safe crashes on missing columns -- Even if the error above were bypassed, cast_table_safe calls table.select(target_field_names) which raises a KeyError for columns in the target Iceberg schema that don't exist in the source Arrow table. validate_cast already documents the correct intent -- "Field X exists in target but not in source (will be null)" -- but the cast logic never followed through.
allow_column_drops=True was a no-op -- dropped_fields was computed and validated but never passed to apply_schema_evolution, so columns were never actually removed from the Iceberg schema regardless of the flag value.

Solution

Sparse data (allow_column_drops=False, default): the Iceberg table schema stays unchanged and new rows receive null for columns they don't contain.
Column drops (allow_column_drops=True): columns missing from incoming data are removed from the Iceberg schema via update.delete_column().
No changes to destination.py -- allow_column_drops=False remains the correct default.

Changes

`schema_evolution.py`

Function	Change
`validate_schema_changes`	Removed the `SchemaEvolutionError` for dropped columns. Neither case warrants an error -- `allow_column_drops=True` removes columns via `apply_schema_evolution`, and `allow_column_drops=False` leaves them in the schema with nulls filled at write time.
`apply_schema_evolution`	Added `dropped_fields` parameter. When provided, calls `update.delete_column()` for each field -- the actual implementation of `allow_column_drops=True` that was previously missing.
`evolve_schema_if_needed`	Logs sparse columns as a warning when `allow_column_drops=False`. Passes `dropped_fields` to `apply_schema_evolution` only when `allow_column_drops=True`. Returns early without schema changes for the sparse case.

`schema_casting.py`

Function	Change
`cast_table_safe`	Before `table.select(target_field_names)`, adds a null column (`pa.nulls`) for any field in the target schema missing from the source table. Completes what `validate_cast` already documents as the intended behavior.

Tests

New test_sparse_schema.py covering all three problems
Existing test_schema_evolution.py updated to reflect corrected behavior

Tests cover: - validate_schema_changes incorrectly raises on sparse data - cast_table_safe crashes when source is missing target columns - apply_schema_evolution never deletes columns for allow_column_drops=True

schema_evolution.py: - Remove SchemaEvolutionError for dropped columns. allow_column_drops=False leaves columns in schema with nulls at write time; allow_column_drops=True removes them via apply_schema_evolution. - Add dropped_fields parameter to apply_schema_evolution, calling update.delete_column() for each field. - evolve_schema_if_needed logs sparse columns as warning when allow_column_drops=False and returns early without schema changes. Passes dropped_fields to apply_schema_evolution only when allow_column_drops=True. schema_casting.py: - cast_table_safe adds null columns (pa.nulls) for any field in the target schema missing from the source table before table.select(), completing the behavior validate_cast already documents.

hentzthename added 2 commits April 8, 2026 04:06

test: add failing tests for sparse schema and allow_column_drops bugs

8ffb1c8

Tests cover: - validate_schema_changes incorrectly raises on sparse data - cast_table_safe crashes when source is missing target columns - apply_schema_evolution never deletes columns for allow_column_drops=True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/sparse schema and allow column drops#15

Fix/sparse schema and allow column drops#15
hentzthename wants to merge 2 commits intosidequery:mainfrom
hentzthename:fix/sparse-schema-and-allow-column-drops

hentzthename commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hentzthename commented Apr 7, 2026

Problems

Solution

Changes

schema_evolution.py

schema_casting.py

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`schema_evolution.py`

`schema_casting.py`