feat: Plan + Implementation for 392 managed storage improvements#393
feat: Plan + Implementation for 392 managed storage improvements#393
Conversation
Greptile SummaryThis PR replaces the two-layer Key changes:
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py | New PersonReader ABC replaces ManagedBlobStorage + ManagedDatasetRepository. Defines create_duckdb_connection() and get_dataset_uri() abstract methods, a cached_property connection, and an execute() method with proper cursor cleanup. LocalPersonReader provides the default local filesystem implementation. |
| packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py | Simplified to accept PersonReader + locale instead of ManagedDatasetRepository + dataset_name. SQL now uses FROM '{uri}' via reader.get_dataset_uri(). Mutable default evidence={} fixed to None. |
| packages/data-designer/src/data_designer/interface/data_designer.py | Adds `person_reader: PersonReader |
| packages/data-designer-engine/tests/engine/resources/test_person_reader.py | New tests cover URI construction, connection creation, and the factory function, but do not test the base execute() method at all — specifically missing coverage for cursor cleanup when an exception is raised during query execution. |
| packages/data-designer-engine/tests/engine/resources/test_managed_dataset_generator.py | Tests updated to use stub_reader (mocked PersonReader) and assert SQL now uses FROM '{uri}' syntax. All previous parameterized scenarios are preserved. |
Sequence Diagram
sequenceDiagram
participant Client
participant DD as DataDesigner
participant RP as ResourceProvider
participant SCG as SamplerColumnGenerator
participant MDG as ManagedDatasetGenerator
participant PR as PersonReader
Client->>DD: DataDesigner(person_reader=...)
DD->>DD: store self._person_reader
DD->>RP: create_resource_provider(person_reader=self._person_reader or create_person_reader(...))
RP->>SCG: resource_provider.person_reader
note over SCG: During generation
SCG->>SCG: _person_generator_loader (property)
SCG->>MDG: load_person_data_sampler(reader, locale)
MDG-->>SCG: ManagedDatasetGenerator(reader, locale)
SCG->>MDG: generate_samples(size, evidence)
MDG->>PR: get_dataset_uri(locale)
PR-->>MDG: uri (e.g. "/data/datasets/en_US.parquet")
MDG->>MDG: build SQL: SELECT * FROM '{uri}' WHERE ... LIMIT n
MDG->>PR: execute(query, parameters)
PR->>PR: _conn (cached_property → create_duckdb_connection())
PR->>PR: cursor = _conn.cursor()
PR->>PR: cursor.execute(query, parameters).df()
PR->>PR: cursor.close() [finally]
PR-->>MDG: pd.DataFrame
MDG-->>SCG: pd.DataFrame
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/resources/test_person_reader.py
Line: 43
Comment:
**Missing test for `execute()` cursor cleanup on exception**
The deleted `test_managed_dataset_repository.py` included `test_duckdb_dataset_repository_query_cursor_cleanup`, which verified the cursor is always closed in the `finally` block even when `cursor.execute()` raises. The new `test_person_reader.py` has no equivalent test for `PersonReader.execute()`, leaving this safety property untested.
Consider adding a test such as:
```python
@patch("data_designer.engine.resources.person_reader.lazy.duckdb", autospec=True)
def test_execute_closes_cursor_on_exception(mock_duckdb: object, stub_reader: LocalPersonReader) -> None:
mock_conn = Mock()
mock_cursor = Mock()
mock_duckdb.connect.return_value = mock_conn
mock_conn.cursor.return_value = mock_cursor
mock_cursor.execute.side_effect = Exception("query failed")
with pytest.raises(Exception, match="query failed"):
stub_reader.execute("SELECT 1", [])
mock_cursor.close.assert_called_once()
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "Update plan to refle..."
|
Should I think
An alternative: pass the connection directly to def load_managed_dataset_repository(
blob_storage: ManagedBlobStorage,
locales: list[str],
duckdb_conn: duckdb.DuckDBPyConnection | None = None,
) -> ManagedDatasetRepository:
return DuckDBDatasetRepository(
blob_storage,
config={"threads": 1, "memory_limit": "2 gb"},
data_catalog=[Table(f"{locale}.parquet") for locale in locales],
use_cache=blob_storage.use_cache,
duckdb_conn=duckdb_conn,
)This keeps The tradeoff is that the caller has to know to pair the connection with the storage backend rather than the storage backend encapsulating that. But that seems preferable to making every |
|
This is all good feedback. It has me returning to an old thought: that the abstraction here is wrong.
Makes sense, but the problem is the
Yep, but, this is the only ABC exposed to client applications that can affect how the managed datasets are queried. Specifically, clients provide an implementation of that ABC to
Sort of, but I would argue the abstraction or contract for seed datasets is: I think person sampling with the personas datasets c/should operate similarly: plan on always using DuckDB (I don't think we expect to ever make a separate Two other related notes:
Should we take a larger hammer to this? |
Yes, perhaps!? As you've pointed out, some of these feel like old baggage we can probably shed with a better design? |
77f81aa to
b97a9f9
Compare
packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/nemotron_personas_reader.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/interface/data_designer.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/samplers.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/tests/engine/resources/test_nemotron_personas_reader.py
Outdated
Show resolved
Hide resolved
b97a9f9 to
3d49808
Compare
packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/person_reader.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/resources/managed_dataset_generator.py
Show resolved
Hide resolved
b92aaae to
0adb537
Compare
Initial plan to address #392.