Add structure to unify setting sources and targets of Datasets #1304

ClaraBuettner · 2025-07-09T11:55:02Z

Deals with #1283

I implemented a first draft to add the attributes sources and targets to all Datasets.
The attributes are added to every Dataset, but remain empty as long as they are not filled in the initialization.
There are also some functions added to DatasetSources and DatasetTargets, e.g. to easily get the schema of a specific table.

The sources and targets are written into new columns of the table metadata.datasets. Submodules of the datasets can access the attributes from there to avoid importing the datasets class into the submodules (which always leads to a lot of problems with circular imports). In addition, this allows to get a quick overview of the sources and targets of all datasets.

To be able to access the entries of metadata.datasets before all tasks of the dataset are executed, each dataset is first registered in the table, meaning that the information on the sources and targets is added. The version is still added after the execution of all tasks, in order to keep the versioning feature running.

I tried the new structure for the datasets ZensusPopulation, ZensusMiscellaneous and HeatSupply.
We will extend the usage in the future; those three are just meant to be examples.

If you have any questions or comments, feel free to ask. I'm happy about any feedback!

…taset

… DatasetTargets

…executed The version is set after the execution of all tasks

…rict heating

nesnoj · 2025-07-09T12:22:36Z

Thx a lot for this proposal @ClaraBuettner!

@jh-RLI This PR is a first draft for complementing the current dataset-class-/task-/process- based dependency graph by a data(set)-based one. It is a requirement for automatic metadata creation #1298
Feel free to add your feedback too.

jh-RLI · 2025-07-09T13:43:58Z

Looks very nice!
If I understand correctly, you will then have to list all sources and targets in each Dataset class once. You can do this now as you switched the approach and during the task-/process in the pipeline only the listed datasets are used.

For me and my automation tasks regarding metadata and OEP upload, this would mean i can import a class like HeatSupply and get all source and target table resources which i can then use to access the table structure from DB or pandas to generate parts of the metadata and later also to read the data to chunk upload to the OEP.

As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like HeatSupply? Or is the dataset always one thing which should list source and target resources.

ClaraBuettner · 2025-07-15T08:16:35Z

As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like HeatSupply? Or is the dataset always one thing which should list source and target resources.

Every dataset will get sources and targets which are both n tables (and files). Many datasets fill more than one table, and some tables are also filled by multiple datasets. But I'm sorry, I am not completely sure if I got your question right. Does that answer helps you?

jh-RLI · 2025-07-15T09:32:19Z

As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like HeatSupply? Or is the dataset always one thing which should list source and target resources.

Every dataset will get sources and targets which are both n tables (and files). Many datasets fill more than one table, and some tables are also filled by multiple datasets. But I'm sorry, I am not completely sure if I got your question right. Does that answer helps you?

Sorry I still get started with the project ... some questions might not be fully reasonable :) Im trying to get my mental model right.

I was comparing source and target to an input and output dataset like something that would be used in processes like input -> processing -> output.
Then i was thinking about the data publishing like what ends up on the OEP. I saw two options:

one dataset per process. This would be then a combination of source and process in one datapackage.
Or is it reasonable to split all processes into two datasets per process for source and target?

Now i think it is more like all dataset classes represent n tables for all sources or targets. All dataset classes collect the dependency information of the complete data bundle. I think this is what i want to collect, describe with metadata and publish on the OEP.

Hope this makes it a bit more clear. Your comment also helped already.

ClaraBuettner added 11 commits July 8, 2025 15:07

Introduce DatasetSources and DatasetTargets that are added to each Da…

1ca6fbf

…taset

Add sources and targets attributes to Dataset-Metadata Table

a014745

Add function to read sources and targets from the database

3ce2d1c

Add DatasetSources and DatasetTargets to Zensus Dataset

f1fb5ba

Access dataset sources and targets from attributes DatasetSources and…

d4329fe

… DatasetTargets

Fix load_sources_and_targets function

7a53748

Register datasets already in the metadata-table before all tasks are …

9f41011

…executed The version is set after the execution of all tasks

Use DatasetSources and DatasetTargets in Heat Supply dataset for dist…

b77b35a

…rict heating

Add function to export sources and targets of all datasets

176d0bd

Split overly long url-strings

aae1029

Use DatasetSources and DatasetTarget in HeatSupply.individual_heating

76b61af

ClaraBuettner requested a review from nesnoj July 9, 2025 11:58

nesnoj mentioned this pull request Jul 9, 2025

[FEATURE] Refactor and streamline metadata creation #1298

Open

nesnoj added 🚀 feature New feature or feature request 🔄 workflow It's about the workflow (airflow) labels Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add structure to unify setting sources and targets of Datasets #1304

Add structure to unify setting sources and targets of Datasets #1304

Uh oh!

ClaraBuettner commented Jul 9, 2025

Uh oh!

nesnoj commented Jul 9, 2025

Uh oh!

jh-RLI commented Jul 9, 2025 •

edited

Loading

Uh oh!

ClaraBuettner commented Jul 15, 2025

Uh oh!

jh-RLI commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add structure to unify setting sources and targets of Datasets #1304

Are you sure you want to change the base?

Add structure to unify setting sources and targets of Datasets #1304

Uh oh!

Conversation

ClaraBuettner commented Jul 9, 2025

Uh oh!

nesnoj commented Jul 9, 2025

Uh oh!

jh-RLI commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClaraBuettner commented Jul 15, 2025

Uh oh!

jh-RLI commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jh-RLI commented Jul 9, 2025 •

edited

Loading