DBA-243 Extract profiling into a separate step disabled by default by annav1asova · Pull Request #163 · JetBrains/databao-context-engine

annav1asova · 2026-03-17T13:46:58Z

No description provided.

…ed dicts

src/databao_context_engine/plugins/databases/duckdb/duckdb_introspector.py

JulienArzul · 2026-03-18T12:52:05Z

src/databao_context_engine/plugins/databases/base_introspector.py

+            if not schemas_to_introspect:
+                continue
+
+            with self._connect(file_config, catalog=catalog) as conn:


I'm not sure what kind of perf overhead it has but with this new code, we're connecting to each catalog twice
=> we used to have 1 + n connections and now we have 1+2n connections, with n being the number of catalogs

Does every DB lib implement a connection pool? Do we usually only introspect one catalog? Or do we expect a lot of catalog to be introspected in the same datasource?

I dont understand, how do we do it twice? It has not changed as far as I can tell

I changed it to 1+n in the latest commit. Before that resolving scope was a separate step with n connections

src/databao_context_engine/plugins/databases/duckdb/duckdb_introspector.py

src/databao_context_engine/plugins/databases/base_introspector.py

JulienArzul · 2026-03-18T13:02:31Z

src/databao_context_engine/plugins/databases/base_introspector.py

+        if table_stats:
+            table_stats_map = {(e.schema_name, e.table_name): e for e in table_stats}
+            for schema in schemas:
+                for table in schema.tables:


Annoying that we have to iterate over everything to only fill some tables with a stat.

That's one more argument towards not using a list but a dict for the schemas/tables/columns attributes:

class DatabaseSchema: tables: dict[str, DatabaseTable] """table_name to table object"""

JulienArzul · 2026-03-18T13:33:31Z

src/databao_context_engine/plugins/databases/base_introspector.py

    def introspect_database(self, file_config: T) -> DatabaseIntrospectionResult:
+        scope = self._resolve_scope(file_config)
+        sampling_matcher = SamplingScopeMatcher(file_config.sampling, ignored_schemas=self._ignored_schemas())
+        profiling_enabled = bool(file_config.profiling and file_config.profiling.enabled)


Just thinking out loud: should this default behaviour be set by the introspector implementation?

e.g. in my understanding, we get that data basically for free in Postgresql so it might make sense to always include it?

yes, we can, though it might be a bit confusing that output files contain different data depending on the connection type when no profiling value is specified in the config

maybe we also need some global profiling flag, because it might be annoying to set it for every datasource

though it might be a bit confusing that output files contain different data depending on the connection type when no profiling value is specified in the config

Yes, you're right, it's probably better to keep the same default for everything

maybe we also need some global profiling flag, because it might be annoying to set it for every datasource

Yeah... Ideally, for all of our config settings, we should have some global config equivalent as well.

In your example:

in dce.ini, you can set the profiling_enabled = true (or even something per datasource type like profiling_enabled: { "postgres": true })

in each datasource config, you can override the global setting specifically for this datasource

But that's probably not required until we have more users, with more complex use-cases

src/databao_context_engine/plugins/databases/databases_types.py

JulienArzul · 2026-03-19T12:33:13Z

src/databao_context_engine/plugins/databases/databases_types.py

+    columns: list[ColumnRef] = field(default_factory=list)
+
+
+@dataclass(frozen=True, slots=True)


❓ What's the benefit of slots=True here? I read quickly about it at some point and I thought it was a low-level optimisation

It doesn't create __dict__ for objects making them a bit more lightweight, as far as I understand it can be helpful when we create lots of small immutable objects with a fixed structure. But it's low-level for sure, so it may not matter much in our case

Ah ok, so you added as a memory concern.

It doesn't matter too much for this one since we're not serialising it. But our serialisation to YAML actually depends on __dict__ being present to be able to list all public attributes in the object and serialise them.

So I'm just a bit worried that we start copying this header on all of our dataclasses without thinking twice about it and it breaks our serialisation 😛

annav1asova added 2 commits March 17, 2026 14:46

DBA-243 Extract profiling into a separate step disabled by default

f618e30

Use TableStatsEntry and ColumnStatsEntry dataclasses instead of untyp…

a127d26

…ed dicts

hsestupin reviewed Mar 18, 2026

View reviewed changes

src/databao_context_engine/plugins/databases/duckdb/duckdb_introspector.py Outdated Show resolved Hide resolved

JulienArzul reviewed Mar 18, 2026

View reviewed changes

Pass lightweight CatalogScope to collect_stats method

06a11d4

JulienArzul reviewed Mar 19, 2026

View reviewed changes

src/databao_context_engine/plugins/databases/databases_types.py Outdated Show resolved Hide resolved

JulienArzul reviewed Mar 19, 2026

View reviewed changes

Fix SchemaRef name

329f9f7

annav1asova merged commit 99c524e into main Mar 20, 2026
10 checks passed

annav1asova deleted the DBA-243-extract-profiling branch March 20, 2026 12:01

		columns: list[ColumnRef] = field(default_factory=list)


		@dataclass(frozen=True, slots=True)

Conversation

annav1asova commented Mar 17, 2026

Uh oh!

Uh oh!

JulienArzul Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsestupin Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annav1asova Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JulienArzul Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

JulienArzul Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

annav1asova Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

annav1asova Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

JulienArzul Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JulienArzul Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

annav1asova Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulienArzul Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulienArzul Mar 18, 2026 •

edited

Loading

hsestupin Mar 19, 2026 •

edited

Loading

JulienArzul Mar 20, 2026 •

edited

Loading

annav1asova Mar 19, 2026 •

edited

Loading