Skip to content

Multischema datasets#3770

Merged
rudolfix merged 27 commits intodevelfrom
feat/3746-multischema-datasets
Apr 12, 2026
Merged

Multischema datasets#3770
rudolfix merged 27 commits intodevelfrom
feat/3746-multischema-datasets

Conversation

@burnash
Copy link
Copy Markdown
Collaborator

@burnash burnash commented Mar 23, 2026

Closes #3746

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Mar 23, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 68ddbab Commit Preview URL

Branch Preview URL
Apr 12 2026, 05:35 PM

@rudolfix rudolfix added the breaking This issue introduces breaking change label Mar 23, 2026
Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls take a look at the original issue: we want to unify SQLGlot schema, not dlt schemas - many reasons for that, one of those being able to support foreign schemas with the same code. ping me for details

Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/lineage.py Outdated
Comment thread dlt/pipeline/pipeline.py Outdated
Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py
union_all_expr: Optional[sge.Query] = None

for table_name in selected_tables:
counts_expr = build_row_counts_expr(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this reimplements filtering by _dlt_load_id available on the relation

Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py Outdated
@@ -505,12 +591,12 @@
def _get_latest_load_id(dataset: dlt.Dataset) -> Optional[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment thread dlt/pipeline/pipeline.py Outdated
Comment thread dlt/pipeline/pipeline.py
Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good! but code could be simpler. we could also use unify_schemas when generating sqlglot schema

Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/dataset/dataset.py
Comment thread dlt/dataset/lineage.py
Comment thread dlt/dataset/lineage.py Outdated
@burnash burnash force-pushed the feat/3746-multischema-datasets branch from 1bb04a2 to 27ad7b5 Compare April 1, 2026 11:59
burnash added 2 commits April 2, 2026 15:00
…stination.

row_counts() collects tables from all schemas, but WithTableScanners only resolved tables against the default schema causing table not found errors
Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find with WithTableScanners supporting just one schema! I suggest some improvements with instance check. I just looked at the test:

  1. no test with schemas that have overlapping tables which are not identical (_dlt_tables are) but have non conflicting columns (different names or same types)
  2. a test that shows column conflict that actually prevents schema unification (ie. same column name, different data type - that should fail)
  3. schemas with different naming convention ie. sql naming convention case sensitive and insensitive. I'd like to let unify_schema to work with different naming conventions. for now test should expect unify_schema to fail

Comment thread dlt/destinations/sql_client.py Outdated
Comment thread dlt/destinations/impl/duckdb/sql_client.py Outdated
"""Tells if a view for a table `table_schema` can be created"""
pass

def set_schemas(self, schemas: Sequence[Schema]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find, I forgot about table scanners. I wanted to make it a mixin class (not to derive from duckdb) then adding set_schemas to it would be trivial

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify if mixing refactor of WithTableScanners be part of this PR or a separate one? The coupling to duckdb is deep now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not this PR, this is old tech debt not to be fixed now

Comment thread dlt/dataset/dataset.py Outdated
@burnash burnash requested a review from rudolfix April 8, 2026 19:46
Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was still a problem with filesystem sql_client that I discovered by running test with conflicting user tables on it

  • we need to UNION views that point to separate data locations
  • we need to merge columns in views on the same data location
    this is now done. create_view just returns sql - this allows manipulation of SQL to handle cases above.
    I promoted a few tests so they run on all destinations incl. lance and delta/iceberg. this surfaced ibis() creation problem (schemas were not passed there). that I fixed

I think ticket is complete. docs still remain:
docs/website/docs/general-usage/dataset-access/dataset.md
should explain how we deal with multiple schemas in dataset. it could be NOTE admonition - we do not recommend having such datasets.

"""Tells if a view for a table `table_schema` can be created"""
pass

def set_schemas(self, schemas: Sequence[Schema]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not this PR, this is old tech debt not to be fixed now

rudolfix
rudolfix previously approved these changes Apr 12, 2026
Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit 6c91afc into devel Apr 12, 2026
75 of 76 checks passed
@rudolfix rudolfix deleted the feat/3746-multischema-datasets branch April 12, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking This issue introduces breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(feat) multischema datasets with local and foreign schemas

2 participants