-
Notifications
You must be signed in to change notification settings - Fork 3
Add structure to unify setting sources and targets of Datasets #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…executed The version is set after the execution of all tasks
|
Thx a lot for this proposal @ClaraBuettner! @jh-RLI This PR is a first draft for complementing the current dataset-class-/task-/process- based dependency graph by a data(set)-based one. It is a requirement for automatic metadata creation #1298 |
|
Looks very nice! For me and my automation tasks regarding metadata and OEP upload, this would mean i can import a class like As i see it now the sources and targets could be two datasets as input/output (n table resources) as defined dataset class like child like |
Every dataset will get sources and targets which are both n tables (and files). Many datasets fill more than one table, and some tables are also filled by multiple datasets. But I'm sorry, I am not completely sure if I got your question right. Does that answer helps you? |
Sorry I still get started with the project ... some questions might not be fully reasonable :) Im trying to get my mental model right. I was comparing source and target to an input and output dataset like something that would be used in processes like input -> processing -> output.
Now i think it is more like all dataset classes represent n tables for all sources or targets. All dataset classes collect the dependency information of the complete data bundle. I think this is what i want to collect, describe with metadata and publish on the OEP. Hope this makes it a bit more clear. Your comment also helped already. |
Deals with #1283
I implemented a first draft to add the attributes
sourcesandtargetsto allDatasets.The attributes are added to every
Dataset, but remain empty as long as they are not filled in the initialization.There are also some functions added to
DatasetSourcesandDatasetTargets, e.g. to easily get the schema of a specific table.The
sourcesandtargetsare written into new columns of the tablemetadata.datasets. Submodules of the datasets can access the attributes from there to avoid importing the datasets class into the submodules (which always leads to a lot of problems with circular imports). In addition, this allows to get a quick overview of thesourcesandtargetsof all datasets.To be able to access the entries of
metadata.datasetsbefore all tasks of the dataset are executed, each dataset is first registered in the table, meaning that the information on the sources and targets is added. The version is still added after the execution of all tasks, in order to keep the versioning feature running.I tried the new structure for the datasets
ZensusPopulation,ZensusMiscellaneousandHeatSupply.We will extend the usage in the future; those three are just meant to be examples.
If you have any questions or comments, feel free to ask. I'm happy about any feedback!