feat: Plan + Implementation for 392 managed storage improvements by mikeknep · Pull Request #393 · NVIDIA-NeMo/DataDesigner

mikeknep · 2026-03-10T21:35:11Z

Initial plan to address #392.

plans/392/managed_storage_refactor_plan.md

greptile-apps · 2026-03-10T21:37:45Z

Greptile Summary

This PR replaces the two-layer ManagedBlobStorage + ManagedDatasetRepository abstraction with a single, focused PersonReader ABC modeled after the existing SeedReader pattern. The core motivation (from #392) is enabling client applications to inject custom duckdb connections — for example, registering a custom fsspec filesystem for S3 access — without being blocked by the old design where DuckDBDatasetRepository always created its own internal duckdb.connect().

Key changes:

New PersonReader ABC (person_reader.py): defines create_duckdb_connection() and get_dataset_uri(locale) as abstract methods; the base class owns connection lifecycle via functools.cached_property and exposes execute() with cursor cleanup in a finally block.
LocalPersonReader replaces LocalBlobStorageProvider + DuckDBDatasetRepository for the default local-filesystem path, eliminating the old background-thread dataset registration and caching logic (which is no longer needed when DuckDB reads the file directly by URI).
ManagedDatasetGenerator now delegates URI resolution and query execution entirely to the injected PersonReader, and the mutable-default evidence={} bug is fixed to evidence=None.
DataDesigner gains a new person_reader: PersonReader | None constructor parameter, enabling the S3 injection use-case without subclassing.
SamplerColumnGenerator raises a clear DataDesignerRuntimeError when person generation is attempted but no reader is configured.
managed_storage.py and managed_dataset_repository.py (and their tests) are deleted entirely.
One test coverage gap: the new test_person_reader.py does not include a test verifying that execute() closes the cursor when an exception is raised — a scenario that was explicitly covered in the deleted test_managed_dataset_repository.py.

Confidence Score: 4/5

Safe to merge; the refactor is clean and well-tested, with one minor test coverage gap.
The implementation accurately follows the plan, all call sites are updated, the mutable-default-argument bug is fixed, and the new abstraction is simpler and more extensible than what it replaces. The only deduction is the missing execute() cursor-cleanup-on-exception test in test_person_reader.py.
packages/data-designer-engine/tests/engine/resources/test_person_reader.py — missing coverage for execute() exception path.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py	New `PersonReader` ABC replaces `ManagedBlobStorage` + `ManagedDatasetRepository`. Defines `create_duckdb_connection()` and `get_dataset_uri()` abstract methods, a `cached_property` connection, and an `execute()` method with proper cursor cleanup. `LocalPersonReader` provides the default local filesystem implementation.
packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py	Simplified to accept `PersonReader` + `locale` instead of `ManagedDatasetRepository` + `dataset_name`. SQL now uses `FROM '{uri}'` via `reader.get_dataset_uri()`. Mutable default `evidence={}` fixed to `None`.
packages/data-designer/src/data_designer/interface/data_designer.py	Adds `person_reader: PersonReader
packages/data-designer-engine/tests/engine/resources/test_person_reader.py	New tests cover URI construction, connection creation, and the factory function, but do not test the base `execute()` method at all — specifically missing coverage for cursor cleanup when an exception is raised during query execution.
packages/data-designer-engine/tests/engine/resources/test_managed_dataset_generator.py	Tests updated to use `stub_reader` (mocked `PersonReader`) and assert SQL now uses `FROM '{uri}'` syntax. All previous parameterized scenarios are preserved.

Sequence Diagram

sequenceDiagram
    participant Client
    participant DD as DataDesigner
    participant RP as ResourceProvider
    participant SCG as SamplerColumnGenerator
    participant MDG as ManagedDatasetGenerator
    participant PR as PersonReader

    Client->>DD: DataDesigner(person_reader=...)
    DD->>DD: store self._person_reader
    DD->>RP: create_resource_provider(person_reader=self._person_reader or create_person_reader(...))
    RP->>SCG: resource_provider.person_reader

    note over SCG: During generation
    SCG->>SCG: _person_generator_loader (property)
    SCG->>MDG: load_person_data_sampler(reader, locale)
    MDG-->>SCG: ManagedDatasetGenerator(reader, locale)

    SCG->>MDG: generate_samples(size, evidence)
    MDG->>PR: get_dataset_uri(locale)
    PR-->>MDG: uri (e.g. "/data/datasets/en_US.parquet")
    MDG->>MDG: build SQL: SELECT * FROM '{uri}' WHERE ... LIMIT n
    MDG->>PR: execute(query, parameters)
    PR->>PR: _conn (cached_property → create_duckdb_connection())
    PR->>PR: cursor = _conn.cursor()
    PR->>PR: cursor.execute(query, parameters).df()
    PR->>PR: cursor.close() [finally]
    PR-->>MDG: pd.DataFrame
    MDG-->>SCG: pd.DataFrame

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/resources/test_person_reader.py
Line: 43

Comment:
**Missing test for `execute()` cursor cleanup on exception**

The deleted `test_managed_dataset_repository.py` included `test_duckdb_dataset_repository_query_cursor_cleanup`, which verified the cursor is always closed in the `finally` block even when `cursor.execute()` raises. The new `test_person_reader.py` has no equivalent test for `PersonReader.execute()`, leaving this safety property untested.

Consider adding a test such as:

```python
@patch("data_designer.engine.resources.person_reader.lazy.duckdb", autospec=True)
def test_execute_closes_cursor_on_exception(mock_duckdb: object, stub_reader: LocalPersonReader) -> None:
    mock_conn = Mock()
    mock_cursor = Mock()
    mock_duckdb.connect.return_value = mock_conn
    mock_conn.cursor.return_value = mock_cursor
    mock_cursor.execute.side_effect = Exception("query failed")

    with pytest.raises(Exception, match="query failed"):
        stub_reader.execute("SELECT 1", [])

    mock_cursor.close.assert_called_once()
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: "Update plan to refle..."}

plans/392/managed_storage_refactor_plan.md

nabinchha · 2026-03-11T17:05:24Z

Should create_duckdb_connection live on ManagedBlobStorage?

I think use_cache is a natural fit on ManagedBlobStorage — the storage implementation knows whether local caching is redundant. But create_duckdb_connection feels like it's leaking a consumer detail into the storage abstraction.

ManagedBlobStorage is documented as a "low-level interface for access objects in blob storage" — it's about blobs, not about how those blobs get queried. The reason SeedReader owns its DuckDB connection is that it is the thing that reads data via DuckDB. ManagedBlobStorage is a layer below that — it's consumed by DuckDBDatasetRepository, but it doesn't (and shouldn't need to) know that DuckDB is the query engine on top.

An alternative: pass the connection directly to DuckDBDatasetRepository (or load_managed_dataset_repository):

def load_managed_dataset_repository(
    blob_storage: ManagedBlobStorage,
    locales: list[str],
    duckdb_conn: duckdb.DuckDBPyConnection | None = None,
) -> ManagedDatasetRepository:
    return DuckDBDatasetRepository(
        blob_storage,
        config={"threads": 1, "memory_limit": "2 gb"},
        data_catalog=[Table(f"{locale}.parquet") for locale in locales],
        use_cache=blob_storage.use_cache,
        duckdb_conn=duckdb_conn,
    )

This keeps ManagedBlobStorage focused on blob access, keeps DuckDBDatasetRepository in charge of its own connection (matching the current ownership model), and still lets callers provide a custom connection with registered fsspec filesystems. The composition site — wherever load_managed_dataset_repository is called — is already the place that knows both the storage backend and the repository, so it's a natural place to coordinate the connection.

The tradeoff is that the caller has to know to pair the connection with the storage backend rather than the storage backend encapsulating that. But that seems preferable to making every ManagedBlobStorage implementation aware of DuckDB.

mikeknep · 2026-03-11T21:51:24Z

This is all good feedback. It has me returning to an old thought: that the abstraction here is wrong.

An alternative: pass the connection directly to DuckDBDatasetRepository (or load_managed_dataset_repository)

Makes sense, but the problem is the DuckDBDatasetRepository construction / load_managed_dataset_repository function call is not accessible to client applications. The call chain for load_managed_dataset_repository is fairly convoluted:

it is called inside load_person_data_sampler...
... which is partially loaded in SamplerColumnGenerator#_person_generator_loader...
... which dances through a few other private functions in SamplerColumnGenerator, ultimately triggered by the public generate_from_scratch method (which is called by the dataset builder)

ManagedBlobStorage is documented as a "low-level interface for access objects in blob storage" — it's about blobs, not about how those blobs get queried.

Yep, but, this is the only ABC exposed to client applications that can affect how the managed datasets are queried. Specifically, clients provide an implementation of that ABC to create_resource_provider, which is sort of the de facto "dependency injection site" for the engine: it's "the thing you can configure" and pass (with a DataDesignerConfig) to a ColumnWiseDatasetBuilder.

The reason SeedReader owns its DuckDB connection is that it is the thing that reads data via DuckDB.

Sort of, but I would argue the abstraction or contract for seed datasets is: SeedDatasetColumnGenerator will always use DuckDB to read seed datasets, and it expects to receive a DuckDB connection via a SeedReader implementation (provided by the client through the resource provider).

I think person sampling with the personas datasets c/should operate similarly: plan on always using DuckDB (I don't think we expect to ever make a separate ManagedDatasetRepository implementation that replaces DuckDBDatasetRepository), and expect to receive a DuckDB connection via "something" the client application can provide via the resource provider.

Two other related notes:

The ManagedBlobStorage abstraction dates back to when we had personas datasets in an S3 bucket (quite a while ago now!).
The DuckDBDatasetRepository has several features (caching, schemas, etc.) that aren't being used by anything today, and feels a little prematurely abstracted (we still only use it for personas datasets).

Should we take a larger hammer to this?

nabinchha · 2026-03-12T21:05:03Z

Should we take a larger hammer to this?

Yes, perhaps!? As you've pointed out, some of these feel like old baggage we can probably shed with a better design?

packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py

plans/392/refactor_managed_personas_plan.md

packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py

packages/data-designer-engine/src/data_designer/engine/resources/nemotron_personas_reader.py

packages/data-designer/src/data_designer/interface/data_designer.py

packages/data-designer-engine/src/data_designer/engine/column_generators/generators/samplers.py

packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py

packages/data-designer-engine/tests/engine/resources/test_nemotron_personas_reader.py

packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py

packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py

plans/392/refactor_managed_personas_plan.md

nabinchha

🚀

mikeknep requested a review from a team as a code owner March 10, 2026 21:35

mikeknep commented Mar 10, 2026

View reviewed changes