Skip to content

DBA-243 Extract profiling into a separate step disabled by default#163

Merged
annav1asova merged 4 commits intomainfrom
DBA-243-extract-profiling
Mar 20, 2026
Merged

DBA-243 Extract profiling into a separate step disabled by default#163
annav1asova merged 4 commits intomainfrom
DBA-243-extract-profiling

Conversation

@annav1asova
Copy link
Collaborator

No description provided.

if not schemas_to_introspect:
continue

with self._connect(file_config, catalog=catalog) as conn:
Copy link
Collaborator

@JulienArzul JulienArzul Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what kind of perf overhead it has but with this new code, we're connecting to each catalog twice
=> we used to have 1 + n connections and now we have 1+2n connections, with n being the number of catalogs

Does every DB lib implement a connection pool? Do we usually only introspect one catalog? Or do we expect a lot of catalog to be introspected in the same datasource?

Copy link
Collaborator

@hsestupin hsestupin Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand, how do we do it twice? It has not changed as far as I can tell

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to 1+n in the latest commit. Before that resolving scope was a separate step with n connections

if table_stats:
table_stats_map = {(e.schema_name, e.table_name): e for e in table_stats}
for schema in schemas:
for table in schema.tables:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoying that we have to iterate over everything to only fill some tables with a stat.

That's one more argument towards not using a list but a dict for the schemas/tables/columns attributes:

class DatabaseSchema:
    tables: dict[str, DatabaseTable]
    """table_name to table object"""

def introspect_database(self, file_config: T) -> DatabaseIntrospectionResult:
scope = self._resolve_scope(file_config)
sampling_matcher = SamplingScopeMatcher(file_config.sampling, ignored_schemas=self._ignored_schemas())
profiling_enabled = bool(file_config.profiling and file_config.profiling.enabled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud: should this default behaviour be set by the introspector implementation?

e.g. in my understanding, we get that data basically for free in Postgresql so it might make sense to always include it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can, though it might be a bit confusing that output files contain different data depending on the connection type when no profiling value is specified in the config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we also need some global profiling flag, because it might be annoying to set it for every datasource

Copy link
Collaborator

@JulienArzul JulienArzul Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though it might be a bit confusing that output files contain different data depending on the connection type when no profiling value is specified in the config

Yes, you're right, it's probably better to keep the same default for everything

maybe we also need some global profiling flag, because it might be annoying to set it for every datasource

Yeah... Ideally, for all of our config settings, we should have some global config equivalent as well.

In your example:

  • in dce.ini, you can set the profiling_enabled = true (or even something per datasource type like profiling_enabled: { "postgres": true })
  • in each datasource config, you can override the global setting specifically for this datasource

But that's probably not required until we have more users, with more complex use-cases

columns: list[ColumnRef] = field(default_factory=list)


@dataclass(frozen=True, slots=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ What's the benefit of slots=True here? I read quickly about it at some point and I thought it was a low-level optimisation

Copy link
Collaborator Author

@annav1asova annav1asova Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't create __dict__ for objects making them a bit more lightweight, as far as I understand it can be helpful when we create lots of small immutable objects with a fixed structure. But it's low-level for sure, so it may not matter much in our case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so you added as a memory concern.

It doesn't matter too much for this one since we're not serialising it. But our serialisation to YAML actually depends on __dict__ being present to be able to list all public attributes in the object and serialise them.

So I'm just a bit worried that we start copying this header on all of our dataclasses without thinking twice about it and it breaks our serialisation 😛

@annav1asova annav1asova merged commit 99c524e into main Mar 20, 2026
10 checks passed
@annav1asova annav1asova deleted the DBA-243-extract-profiling branch March 20, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants