Skip to content

Fix/sparse schema and allow column drops#15

Open
hentzthename wants to merge 2 commits intosidequery:mainfrom
hentzthename:fix/sparse-schema-and-allow-column-drops
Open

Fix/sparse schema and allow column drops#15
hentzthename wants to merge 2 commits intosidequery:mainfrom
hentzthename:fix/sparse-schema-and-allow-column-drops

Conversation

@hentzthename
Copy link
Copy Markdown
Contributor

Hi Nico, side note, thanks for your work on the _dlt_loads table, I was looking for it, and all I had to do was install your latest release 😄

The 'sparse data' problem may sound similar to my previous PR: #10

But that PR addressed the scenario where dlt didn't know about the existing wide schema in a fresh container. Once the schema is known (either from _dlt_version or derivation), I discovered these downstream problems:

Problems

When ingesting sparse data (subsequent runs with fewer columns than the established schema), 3 problems surface:

  1. False SchemaEvolutionError -- validate_schema_changes raises when columns are "dropped" (present in table but absent in incoming data), even though the incoming data is not requesting a schema change. The columns should remain in the Iceberg table and new rows should receive nulls.

    SchemaEvolutionError: Schema evolution validation failed:
      - Columns dropped (not safe): col1, col2, ...
    
  2. cast_table_safe crashes on missing columns -- Even if the error above were bypassed, cast_table_safe calls table.select(target_field_names) which raises a KeyError for columns in the target Iceberg schema that don't exist in the source Arrow table. validate_cast already documents the correct intent -- "Field X exists in target but not in source (will be null)" -- but the cast logic never followed through.

  3. allow_column_drops=True was a no-op -- dropped_fields was computed and validated but never passed to apply_schema_evolution, so columns were never actually removed from the Iceberg schema regardless of the flag value.


Solution

  • Sparse data (allow_column_drops=False, default): the Iceberg table schema stays unchanged and new rows receive null for columns they don't contain.
  • Column drops (allow_column_drops=True): columns missing from incoming data are removed from the Iceberg schema via update.delete_column().
  • No changes to destination.py -- allow_column_drops=False remains the correct default.

Changes

schema_evolution.py

Function Change
validate_schema_changes Removed the SchemaEvolutionError for dropped columns. Neither case warrants an error -- allow_column_drops=True removes columns via apply_schema_evolution, and allow_column_drops=False leaves them in the schema with nulls filled at write time.
apply_schema_evolution Added dropped_fields parameter. When provided, calls update.delete_column() for each field -- the actual implementation of allow_column_drops=True that was previously missing.
evolve_schema_if_needed Logs sparse columns as a warning when allow_column_drops=False. Passes dropped_fields to apply_schema_evolution only when allow_column_drops=True. Returns early without schema changes for the sparse case.

schema_casting.py

Function Change
cast_table_safe Before table.select(target_field_names), adds a null column (pa.nulls) for any field in the target schema missing from the source table. Completes what validate_cast already documents as the intended behavior.

Tests

  • New test_sparse_schema.py covering all three problems
  • Existing test_schema_evolution.py updated to reflect corrected behavior

Tests cover:
- validate_schema_changes incorrectly raises on sparse data
- cast_table_safe crashes when source is missing target columns
- apply_schema_evolution never deletes columns for allow_column_drops=True
schema_evolution.py:
- Remove SchemaEvolutionError for dropped columns. allow_column_drops=False
  leaves columns in schema with nulls at write time; allow_column_drops=True
  removes them via apply_schema_evolution.
- Add dropped_fields parameter to apply_schema_evolution, calling
  update.delete_column() for each field.
- evolve_schema_if_needed logs sparse columns as warning when
  allow_column_drops=False and returns early without schema changes.
  Passes dropped_fields to apply_schema_evolution only when
  allow_column_drops=True.

schema_casting.py:
- cast_table_safe adds null columns (pa.nulls) for any field in the target
  schema missing from the source table before table.select(), completing
  the behavior validate_cast already documents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant