Skip to content

Conversation

@ClaraBuettner
Copy link
Contributor

Deals with #1283

I implemented a first draft to add the attributes sources and targets to all Datasets.
The attributes are added to every Dataset, but remain empty as long as they are not filled in the initialization.
There are also some functions added to DatasetSources and DatasetTargets, e.g. to easily get the schema of a specific table.

The sources and targets are written into new columns of the table metadata.datasets. Submodules of the datasets can access the attributes from there to avoid importing the datasets class into the submodules (which always leads to a lot of problems with circular imports). In addition, this allows to get a quick overview of the sources and targets of all datasets.

To be able to access the entries of metadata.datasets before all tasks of the dataset are executed, each dataset is first registered in the table, meaning that the information on the sources and targets is added. The version is still added after the execution of all tasks, in order to keep the versioning feature running.

I tried the new structure for the datasets ZensusPopulation, ZensusMiscellaneous and HeatSupply.
We will extend the usage in the future; those three are just meant to be examples.

If you have any questions or comments, feel free to ask. I'm happy about any feedback!

@ClaraBuettner ClaraBuettner requested a review from nesnoj July 9, 2025 11:58
@nesnoj nesnoj added 🚀 feature New feature or feature request 🔄 workflow It's about the workflow (airflow) labels Jul 9, 2025
@nesnoj
Copy link
Member

nesnoj commented Jul 9, 2025

Thx a lot for this proposal @ClaraBuettner!

@jh-RLI This PR is a first draft for complementing the current dataset-class-/task-/process- based dependency graph by a data(set)-based one. It is a requirement for automatic metadata creation #1298
Feel free to add your feedback too.

@jh-RLI
Copy link

jh-RLI commented Jul 9, 2025

Looks very nice!
If I understand correctly, you will then have to list all sources and targets in each Dataset class once. You can do this now as you switched the approach and during the task-/process in the pipeline only the listed datasets are used.

For me and my automation tasks regarding metadata and OEP upload, this would mean i can import a class like HeatSupply and get all source and target table resources which i can then use to access the table structure from DB or pandas to generate parts of the metadata and later also to read the data to chunk upload to the OEP.

As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like HeatSupply? Or is the dataset always one thing which should list source and target resources.

@ClaraBuettner
Copy link
Contributor Author

As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like HeatSupply? Or is the dataset always one thing which should list source and target resources.

Every dataset will get sources and targets which are both n tables (and files). Many datasets fill more than one table, and some tables are also filled by multiple datasets. But I'm sorry, I am not completely sure if I got your question right. Does that answer helps you?

@jh-RLI
Copy link

jh-RLI commented Jul 15, 2025

As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like HeatSupply? Or is the dataset always one thing which should list source and target resources.

Every dataset will get sources and targets which are both n tables (and files). Many datasets fill more than one table, and some tables are also filled by multiple datasets. But I'm sorry, I am not completely sure if I got your question right. Does that answer helps you?

Sorry I still get started with the project ... some questions might not be fully reasonable :) Im trying to get my mental model right.

I was comparing source and target to an input and output dataset like something that would be used in processes like input -> processing -> output.
Then i was thinking about the data publishing like what ends up on the OEP. I saw two options:

  1. one dataset per process. This would be then a combination of source and process in one datapackage.
  2. Or is it reasonable to split all processes into two datasets per process for source and target?

Now i think it is more like all dataset classes represent n tables for all sources or targets. All dataset classes collect the dependency information of the complete data bundle. I think this is what i want to collect, describe with metadata and publish on the OEP.

Hope this makes it a bit more clear. Your comment also helped already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🔄 workflow It's about the workflow (airflow) 🚀 feature New feature or feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants