DBA-243 Extract profiling into a separate step disabled by default#163
DBA-243 Extract profiling into a separate step disabled by default#163annav1asova merged 4 commits intomainfrom
Conversation
src/databao_context_engine/plugins/databases/duckdb/duckdb_introspector.py
Outdated
Show resolved
Hide resolved
| if not schemas_to_introspect: | ||
| continue | ||
|
|
||
| with self._connect(file_config, catalog=catalog) as conn: |
There was a problem hiding this comment.
I'm not sure what kind of perf overhead it has but with this new code, we're connecting to each catalog twice
=> we used to have 1 + n connections and now we have 1+2n connections, with n being the number of catalogs
Does every DB lib implement a connection pool? Do we usually only introspect one catalog? Or do we expect a lot of catalog to be introspected in the same datasource?
There was a problem hiding this comment.
I dont understand, how do we do it twice? It has not changed as far as I can tell
There was a problem hiding this comment.
I changed it to 1+n in the latest commit. Before that resolving scope was a separate step with n connections
src/databao_context_engine/plugins/databases/duckdb/duckdb_introspector.py
Outdated
Show resolved
Hide resolved
| if table_stats: | ||
| table_stats_map = {(e.schema_name, e.table_name): e for e in table_stats} | ||
| for schema in schemas: | ||
| for table in schema.tables: |
There was a problem hiding this comment.
Annoying that we have to iterate over everything to only fill some tables with a stat.
That's one more argument towards not using a list but a dict for the schemas/tables/columns attributes:
class DatabaseSchema:
tables: dict[str, DatabaseTable]
"""table_name to table object"""
| def introspect_database(self, file_config: T) -> DatabaseIntrospectionResult: | ||
| scope = self._resolve_scope(file_config) | ||
| sampling_matcher = SamplingScopeMatcher(file_config.sampling, ignored_schemas=self._ignored_schemas()) | ||
| profiling_enabled = bool(file_config.profiling and file_config.profiling.enabled) |
There was a problem hiding this comment.
Just thinking out loud: should this default behaviour be set by the introspector implementation?
e.g. in my understanding, we get that data basically for free in Postgresql so it might make sense to always include it?
There was a problem hiding this comment.
yes, we can, though it might be a bit confusing that output files contain different data depending on the connection type when no profiling value is specified in the config
There was a problem hiding this comment.
maybe we also need some global profiling flag, because it might be annoying to set it for every datasource
There was a problem hiding this comment.
though it might be a bit confusing that output files contain different data depending on the connection type when no profiling value is specified in the config
Yes, you're right, it's probably better to keep the same default for everything
maybe we also need some global profiling flag, because it might be annoying to set it for every datasource
Yeah... Ideally, for all of our config settings, we should have some global config equivalent as well.
In your example:
- in
dce.ini, you can set theprofiling_enabled = true(or even something per datasource type likeprofiling_enabled: { "postgres": true }) - in each datasource config, you can override the global setting specifically for this datasource
But that's probably not required until we have more users, with more complex use-cases
src/databao_context_engine/plugins/databases/databases_types.py
Outdated
Show resolved
Hide resolved
| columns: list[ColumnRef] = field(default_factory=list) | ||
|
|
||
|
|
||
| @dataclass(frozen=True, slots=True) |
There was a problem hiding this comment.
❓ What's the benefit of slots=True here? I read quickly about it at some point and I thought it was a low-level optimisation
There was a problem hiding this comment.
It doesn't create __dict__ for objects making them a bit more lightweight, as far as I understand it can be helpful when we create lots of small immutable objects with a fixed structure. But it's low-level for sure, so it may not matter much in our case
There was a problem hiding this comment.
Ah ok, so you added as a memory concern.
It doesn't matter too much for this one since we're not serialising it. But our serialisation to YAML actually depends on __dict__ being present to be able to list all public attributes in the object and serialise them.
So I'm just a bit worried that we start copying this header on all of our dataclasses without thinking twice about it and it breaks our serialisation 😛
No description provided.