Skip to content

feat: Plan + Implementation for 392 managed storage improvements#393

Merged
mikeknep merged 15 commits intomainfrom
managed-storage-refactor
Mar 19, 2026
Merged

feat: Plan + Implementation for 392 managed storage improvements#393
mikeknep merged 15 commits intomainfrom
managed-storage-refactor

Conversation

@mikeknep
Copy link
Contributor

Initial plan to address #392.

@mikeknep mikeknep requested a review from a team as a code owner March 10, 2026 21:35
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR replaces the two-layer ManagedBlobStorage + ManagedDatasetRepository abstraction with a single, focused PersonReader ABC modeled after the existing SeedReader pattern. The core motivation (from #392) is enabling client applications to inject custom duckdb connections — for example, registering a custom fsspec filesystem for S3 access — without being blocked by the old design where DuckDBDatasetRepository always created its own internal duckdb.connect().

Key changes:

  • New PersonReader ABC (person_reader.py): defines create_duckdb_connection() and get_dataset_uri(locale) as abstract methods; the base class owns connection lifecycle via functools.cached_property and exposes execute() with cursor cleanup in a finally block.
  • LocalPersonReader replaces LocalBlobStorageProvider + DuckDBDatasetRepository for the default local-filesystem path, eliminating the old background-thread dataset registration and caching logic (which is no longer needed when DuckDB reads the file directly by URI).
  • ManagedDatasetGenerator now delegates URI resolution and query execution entirely to the injected PersonReader, and the mutable-default evidence={} bug is fixed to evidence=None.
  • DataDesigner gains a new person_reader: PersonReader | None constructor parameter, enabling the S3 injection use-case without subclassing.
  • SamplerColumnGenerator raises a clear DataDesignerRuntimeError when person generation is attempted but no reader is configured.
  • managed_storage.py and managed_dataset_repository.py (and their tests) are deleted entirely.
  • One test coverage gap: the new test_person_reader.py does not include a test verifying that execute() closes the cursor when an exception is raised — a scenario that was explicitly covered in the deleted test_managed_dataset_repository.py.

Confidence Score: 4/5

  • Safe to merge; the refactor is clean and well-tested, with one minor test coverage gap.
  • The implementation accurately follows the plan, all call sites are updated, the mutable-default-argument bug is fixed, and the new abstraction is simpler and more extensible than what it replaces. The only deduction is the missing execute() cursor-cleanup-on-exception test in test_person_reader.py.
  • packages/data-designer-engine/tests/engine/resources/test_person_reader.py — missing coverage for execute() exception path.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py New PersonReader ABC replaces ManagedBlobStorage + ManagedDatasetRepository. Defines create_duckdb_connection() and get_dataset_uri() abstract methods, a cached_property connection, and an execute() method with proper cursor cleanup. LocalPersonReader provides the default local filesystem implementation.
packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py Simplified to accept PersonReader + locale instead of ManagedDatasetRepository + dataset_name. SQL now uses FROM '{uri}' via reader.get_dataset_uri(). Mutable default evidence={} fixed to None.
packages/data-designer/src/data_designer/interface/data_designer.py Adds `person_reader: PersonReader
packages/data-designer-engine/tests/engine/resources/test_person_reader.py New tests cover URI construction, connection creation, and the factory function, but do not test the base execute() method at all — specifically missing coverage for cursor cleanup when an exception is raised during query execution.
packages/data-designer-engine/tests/engine/resources/test_managed_dataset_generator.py Tests updated to use stub_reader (mocked PersonReader) and assert SQL now uses FROM '{uri}' syntax. All previous parameterized scenarios are preserved.

Sequence Diagram

sequenceDiagram
    participant Client
    participant DD as DataDesigner
    participant RP as ResourceProvider
    participant SCG as SamplerColumnGenerator
    participant MDG as ManagedDatasetGenerator
    participant PR as PersonReader

    Client->>DD: DataDesigner(person_reader=...)
    DD->>DD: store self._person_reader
    DD->>RP: create_resource_provider(person_reader=self._person_reader or create_person_reader(...))
    RP->>SCG: resource_provider.person_reader

    note over SCG: During generation
    SCG->>SCG: _person_generator_loader (property)
    SCG->>MDG: load_person_data_sampler(reader, locale)
    MDG-->>SCG: ManagedDatasetGenerator(reader, locale)

    SCG->>MDG: generate_samples(size, evidence)
    MDG->>PR: get_dataset_uri(locale)
    PR-->>MDG: uri (e.g. "/data/datasets/en_US.parquet")
    MDG->>MDG: build SQL: SELECT * FROM '{uri}' WHERE ... LIMIT n
    MDG->>PR: execute(query, parameters)
    PR->>PR: _conn (cached_property → create_duckdb_connection())
    PR->>PR: cursor = _conn.cursor()
    PR->>PR: cursor.execute(query, parameters).df()
    PR->>PR: cursor.close() [finally]
    PR-->>MDG: pd.DataFrame
    MDG-->>SCG: pd.DataFrame
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/resources/test_person_reader.py
Line: 43

Comment:
**Missing test for `execute()` cursor cleanup on exception**

The deleted `test_managed_dataset_repository.py` included `test_duckdb_dataset_repository_query_cursor_cleanup`, which verified the cursor is always closed in the `finally` block even when `cursor.execute()` raises. The new `test_person_reader.py` has no equivalent test for `PersonReader.execute()`, leaving this safety property untested.

Consider adding a test such as:

```python
@patch("data_designer.engine.resources.person_reader.lazy.duckdb", autospec=True)
def test_execute_closes_cursor_on_exception(mock_duckdb: object, stub_reader: LocalPersonReader) -> None:
    mock_conn = Mock()
    mock_cursor = Mock()
    mock_duckdb.connect.return_value = mock_conn
    mock_conn.cursor.return_value = mock_cursor
    mock_cursor.execute.side_effect = Exception("query failed")

    with pytest.raises(Exception, match="query failed"):
        stub_reader.execute("SELECT 1", [])

    mock_cursor.close.assert_called_once()
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Update plan to refle..."

@mikeknep mikeknep changed the title Plan for 392 managed storage improvements feat: Plan for 392 managed storage improvements Mar 10, 2026
@nabinchha
Copy link
Contributor

Should create_duckdb_connection live on ManagedBlobStorage?

I think use_cache is a natural fit on ManagedBlobStorage — the storage implementation knows whether local caching is redundant. But create_duckdb_connection feels like it's leaking a consumer detail into the storage abstraction.

ManagedBlobStorage is documented as a "low-level interface for access objects in blob storage" — it's about blobs, not about how those blobs get queried. The reason SeedReader owns its DuckDB connection is that it is the thing that reads data via DuckDB. ManagedBlobStorage is a layer below that — it's consumed by DuckDBDatasetRepository, but it doesn't (and shouldn't need to) know that DuckDB is the query engine on top.

An alternative: pass the connection directly to DuckDBDatasetRepository (or load_managed_dataset_repository):

def load_managed_dataset_repository(
    blob_storage: ManagedBlobStorage,
    locales: list[str],
    duckdb_conn: duckdb.DuckDBPyConnection | None = None,
) -> ManagedDatasetRepository:
    return DuckDBDatasetRepository(
        blob_storage,
        config={"threads": 1, "memory_limit": "2 gb"},
        data_catalog=[Table(f"{locale}.parquet") for locale in locales],
        use_cache=blob_storage.use_cache,
        duckdb_conn=duckdb_conn,
    )

This keeps ManagedBlobStorage focused on blob access, keeps DuckDBDatasetRepository in charge of its own connection (matching the current ownership model), and still lets callers provide a custom connection with registered fsspec filesystems. The composition site — wherever load_managed_dataset_repository is called — is already the place that knows both the storage backend and the repository, so it's a natural place to coordinate the connection.

The tradeoff is that the caller has to know to pair the connection with the storage backend rather than the storage backend encapsulating that. But that seems preferable to making every ManagedBlobStorage implementation aware of DuckDB.

@mikeknep
Copy link
Contributor Author

mikeknep commented Mar 11, 2026

This is all good feedback. It has me returning to an old thought: that the abstraction here is wrong.

An alternative: pass the connection directly to DuckDBDatasetRepository (or load_managed_dataset_repository)

Makes sense, but the problem is the DuckDBDatasetRepository construction / load_managed_dataset_repository function call is not accessible to client applications. The call chain for load_managed_dataset_repository is fairly convoluted:

  • it is called inside load_person_data_sampler...
  • ... which is partially loaded in SamplerColumnGenerator#_person_generator_loader...
  • ... which dances through a few other private functions in SamplerColumnGenerator, ultimately triggered by the public generate_from_scratch method (which is called by the dataset builder)

ManagedBlobStorage is documented as a "low-level interface for access objects in blob storage" — it's about blobs, not about how those blobs get queried.

Yep, but, this is the only ABC exposed to client applications that can affect how the managed datasets are queried. Specifically, clients provide an implementation of that ABC to create_resource_provider, which is sort of the de facto "dependency injection site" for the engine: it's "the thing you can configure" and pass (with a DataDesignerConfig) to a ColumnWiseDatasetBuilder.

The reason SeedReader owns its DuckDB connection is that it is the thing that reads data via DuckDB.

Sort of, but I would argue the abstraction or contract for seed datasets is: SeedDatasetColumnGenerator will always use DuckDB to read seed datasets, and it expects to receive a DuckDB connection via a SeedReader implementation (provided by the client through the resource provider).

I think person sampling with the personas datasets c/should operate similarly: plan on always using DuckDB (I don't think we expect to ever make a separate ManagedDatasetRepository implementation that replaces DuckDBDatasetRepository), and expect to receive a DuckDB connection via "something" the client application can provide via the resource provider.

Two other related notes:

  • The ManagedBlobStorage abstraction dates back to when we had personas datasets in an S3 bucket (quite a while ago now!).
  • The DuckDBDatasetRepository has several features (caching, schemas, etc.) that aren't being used by anything today, and feels a little prematurely abstracted (we still only use it for personas datasets).

Should we take a larger hammer to this?

@nabinchha
Copy link
Contributor

Should we take a larger hammer to this?

Yes, perhaps!? As you've pointed out, some of these feel like old baggage we can probably shed with a better design?

@mikeknep mikeknep force-pushed the managed-storage-refactor branch from 77f81aa to b97a9f9 Compare March 13, 2026 21:02
@mikeknep mikeknep force-pushed the managed-storage-refactor branch from b97a9f9 to 3d49808 Compare March 16, 2026 21:55
@mikeknep mikeknep changed the title feat: Plan for 392 managed storage improvements feat: Plan + Implementation for 392 managed storage improvements Mar 18, 2026
Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@mikeknep mikeknep force-pushed the managed-storage-refactor branch from b92aaae to 0adb537 Compare March 19, 2026 14:30
@mikeknep mikeknep merged commit 12baad7 into main Mar 19, 2026
47 checks passed
@mikeknep mikeknep deleted the managed-storage-refactor branch March 19, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants