feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder#429
feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder#429andreatgretel wants to merge 13 commits intomainfrom
Conversation
Greptile SummaryThis PR completes the async engine integration series by wiring Key changes:
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Core async scheduler updated with error-rate shutdown, per-RG in-flight counters, pre-batch/post-batch callbacks, disable_early_shutdown, and _admitted_rg_ids pruning on checkpoint. The early-shutdown checkpoint fix (lines 141-143), is_column_complete_for_rg public method usage, and _admitted_rg_ids.discard are all correctly wired. The cumulative (non-sliding) error-rate window is intentional per PR description. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py | Main integration point. _build_async correctly builds the generator map (including MultiColumnConfig expansion), partitions rows into groups, and wires all scheduler callbacks. on_before_checkpoint is correctly gated. However, on_seeds_complete is always passed unconditionally, which sets has_pre_batch=True in the scheduler for all pipelines — unnecessarily serializing dispatch of non-seed tasks until all seeds across the row group complete, even when no PRE_BATCH processors exist. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py | Added is_column_complete_for_rg public method that correctly checks _batch_complete first (for FULL_COLUMN), then does a per-row validation for CELL_BY_CELL columns. Drop-row propagation to _reevaluate_batch_tasks is correct. No issues found. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py | replace_dataframe correctly marks trailing buffer slots as dropped (line 85) when a processor returns fewer rows. get_dataframe filters dropped rows. checkpoint_row_group frees memory after write. No issues found. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py | Good integration test coverage: multi-column pipeline with seed/cell/full-column generators, checkpoint parquet calls, allow_resize validation guard, and graph topology ordering. End-to-end async flow is exercised with @pytest.mark.asyncio. |
Sequence Diagram
sequenceDiagram
participant B as ColumnWiseDatasetBuilder
participant L as BackgroundEventLoop
participant S as AsyncTaskScheduler
participant A as _admit_row_groups
participant T as CompletionTracker
participant M as RowGroupBufferManager
B->>B: _validate_async_compatibility()
B->>B: Build gen_map, ExecutionGraph
B->>T: CompletionTracker.with_graph(graph, row_groups)
B->>M: RowGroupBufferManager(artifact_storage)
B->>S: AsyncTaskScheduler(generators, graph, tracker, row_groups, ...)
B->>L: _ensure_async_engine_loop()
B->>L: run_coroutine_threadsafe(scheduler.run())
activate S
S->>A: create_task(_admit_row_groups)
activate A
loop For each row_group
A->>S: acquire rg_semaphore
A->>M: init_row_group(rg_id, rg_size)
A->>S: _dispatch_seeds(rg_id) → create_task(_execute_seed_task)
A->>S: wake_event.set()
end
deactivate A
loop Main dispatch loop
S->>S: _run_seeds_complete_check(seed_cols)
note over S: Fires on_seeds_complete when all seeds done & no in-flight
S->>B: on_seeds_complete(rg_id, rg_size) [if PRE_BATCH processors]
B->>M: get_dataframe(rg_id) → run_pre_batch → replace_dataframe
B->>T: drop_row for newly-dropped rows
S->>T: get_ready_tasks(dispatched, admitted_rgs)
S->>S: create_task(_execute_task) for each ready task
S->>S: _checkpoint_completed_row_groups(all_columns)
note over S: Calls on_before_checkpoint if POST_BATCH processors present
S->>B: on_before_checkpoint(rg_id, rg_size)
B->>M: get_dataframe → run_post_batch → replace_dataframe
S->>M: checkpoint_row_group(rg_id) → write parquet
S->>S: rg_semaphore.release()
S->>B: on_row_group_complete(rg_id) → on_batch_complete(path)
end
opt Salvage rounds (retryable failures)
S->>S: Re-dispatch deferred tasks
S->>S: _drain_frontier(seed_cols, has_pre_batch, all_columns)
S->>S: _checkpoint_completed_row_groups(all_columns)
end
S-->>L: scheduler.run() returns
L-->>B: future.result() returns
B->>S: collect traces
B->>M: write_metadata(target_num_records, buffer_size)
deactivate S
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Line: 294
Comment:
**`on_seeds_complete` always passed, unnecessarily enabling `has_pre_batch` gate**
`on_seeds_complete` is always passed to `AsyncTaskScheduler` regardless of whether any `PRE_BATCH` processors are configured. Inside the scheduler, `has_pre_batch = self._on_seeds_complete is not None` evaluates to `True`, causing ALL non-seed tasks to be gated behind `_pre_batch_done_rgs` for every row group — even when no pre-batch processing is required.
The consequence is non-trivial for pipelines with multiple parallel seed columns: with `has_pre_batch = True`, both `seed_A` and `seed_B` must fully complete before either `downstream_X` or `downstream_Y` can begin dispatch. Without the gate, `downstream_X` would start as soon as `seed_A` finishes (if `seed_B` is not an upstream dependency of `downstream_X`). This prevents the async scheduler from pipelining work that could otherwise run in parallel.
`on_before_checkpoint` was fixed with the identical guard in this same PR (line 295-297). Apply the same treatment to `on_seeds_complete`:
```suggestion
on_seeds_complete=(on_seeds_complete if self._processor_runner.has_processors_for(ProcessorStage.PRE_BATCH) else None),
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "fix: skip on_before_..."
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Show resolved
Hide resolved
...ges/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Outdated
Show resolved
Hide resolved
...ges/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py
Show resolved
Hide resolved
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Outdated
Show resolved
Hide resolved
|
(AR) Warning: on_seeds_complete not evaluated in salvage rounds
What: The seeds-complete check (lines 142-156) only runs inside the main while True loop. Salvage rounds retry deferred seed tasks but never check whether seeds completed, so the pre-batch processor is silently skipped for row groups whose seeds succeed in salvage. Why: If all seed tasks for a row group are rate-limited on first attempt and succeed in salvage, downstream tasks run and the row group checkpoints without the pre-batch processor having run. Suggestion: |
|
(AR) This PR wires the AsyncTaskScheduler into ColumnWiseDatasetBuilder behind DATA_DESIGNER_ASYNC_ENGINE=1, adding _build_async(), pre/post-batch processor callbacks, error rate shutdown, and task trace surfacing. The sync path is unchanged. All 8 changed files pass ruff clean and 22 tests pass (6 integration + 16 scheduler). build_preview() async path (mentioned in the plan) was not implemented in this PR. The most significant finding is a critical data-loss path in _checkpoint_completed_row_groups: if on_before_checkpoint raises, the row group's buffer is never freed and its data is silently dropped. Two warning-level findings identify scenarios where non-seed tasks can bypass the pre-batch processor barrier (multi-seed pipelines with partial dependencies, and salvage rounds skipping the seeds-complete check), both producing silently incorrect output. The async/sync code path separation is clean and the error rate shutdown mechanism is well-designed. Verdict: needs-changes — 1 critical, 7 warnings, 4 suggestions. |
- Wire on_batch_complete through on_row_group_complete callback - Mark trailing slots as dropped in replace_dataframe when processor filters rows - Ensure checkpoint still runs when on_before_checkpoint raises - Gate non-seed task dispatch on pre-batch completion - Add public run_pre_batch_on_df to ProcessorRunner (replaces private _run_stage call) - Add public is_column_complete_for_rg to CompletionTracker (replaces private _completed access) - Type task_traces as list[TaskTrace] in results.py - Add async_trace docstring to RunConfig - Move module-level log into _build_async - Add replace_dataframe unit tests (same-size, dropped rows, fewer rows) - Assert on public outcomes in scheduler tests instead of private attributes - Parametrize allow_resize validation tests - Cache seed_cols before main loop - Remove redundant disable_early_shutdown from AsyncTaskScheduler
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Show resolved
Hide resolved
...s/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py
Outdated
Show resolved
Hide resolved
- Flush completed row groups before breaking on early shutdown (data loss) - Change error rate check from >= to > so disable_early_shutdown sentinel (1.0) never triggers at 100% failure rate - Extract seeds-complete check into helper and call it in salvage rounds via _drain_frontier, with pre-batch gating, so pre-batch processor runs even when seed tasks succeed only after retry - Fix is_column_complete_for_rg to check _batch_complete first, then verify all non-dropped rows for CELL_BY_CELL columns - Replace O(|in-flight|) scan in _in_flight_for_rg with per-RG counter
|
Addressed all remaining open issues in 3facfef:
All 63 tests in |
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Show resolved
Hide resolved
… safely Pre-batch processors that filter rows marked them as dropped in RowGroupBufferManager but not in CompletionTracker, causing unnecessary LLM calls for rows that would be discarded at checkpoint time. Also wrap the benchmark warmup stderr redirect in try/finally so stderr is restored if _run_once raises.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Show resolved
Hide resolved
Prevents unbounded growth of the admission set across large runs.
Dev-time benchmarks and manual test scripts - kept locally, not needed in the PR.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Show resolved
Hide resolved
|
@nabinchha Re: renaming |
RunConfig.disable_early_shutdown was forwarded to the sync executor but silently ignored in the async path. Now passed through to the scheduler's _check_error_rate.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Show resolved
Hide resolved
Verifies the async scheduler dispatches independent LLM columns concurrently by checking for overlapping task trace intervals. Uses a wide DAG (sampler -> 2 parallel LLM columns) with 2 records. Requires NVIDIA_API_KEY.
…g unprocessed data Matches on_seeds_complete failure behavior and avoids silently checkpointing unfiltered rows when a post-batch processor fails.
|
Response to Greptile summary (37e3b62):
|
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Outdated
Show resolved
Hide resolved
Avoids unnecessary DataFrame round-trip for every row group in the common case where no post-batch processors exist.
| max_conversation_correction_steps: Maximum number of correction rounds permitted within a | ||
| single conversation when generation tasks call `ModelFacade.generate(...)`. Must be >= 0. | ||
| Default is 0. | ||
| async_trace: If True, collect per-task tracing data when using the async engine |
There was a problem hiding this comment.
per-task here may need a little more context?
| return False | ||
| return True | ||
|
|
||
| def is_column_complete_for_rg(self, column: str, row_group: int) -> bool: |
There was a problem hiding this comment.
super nit: row_group -> row_group_index
nabinchha
left a comment
There was a problem hiding this comment.
Nice work on this integration, Andre — the callback wiring and the iterative response to prior feedback are both really well done. Here are my findings from a full read-through via claude.
Findings
Critical — Must fix before merge
None — the prior review's critical findings (silent on_batch_complete drop, replace_dataframe stale data, private attribute access, pre-batch barrier bypass, _run_stage cross-boundary call, bare list typing) have all been addressed in subsequent commits. Great job working through those.
Warnings — Strongly recommend fixing
async_scheduler.py:318 — Redundant seed_cols recomputation in _dispatch_seeds
- What:
_dispatch_seedsrecomputesseed_colsfrom the graph's topological order on every call, butrun()already computesseed_colsas afrozensetat the top of the method and passes it through_run_seeds_complete_checkand_drain_frontier. - Why: Minor inefficiency (re-traverses the graph per row group admission), but more importantly a consistency risk — the two computations use different logic (
get_topological_order+ filter vs.columns+ filter). If the graph ever distinguishes between these, they could diverge silently. - Suggestion: Pass the already-computed
seed_colsfrozenset into_dispatch_seedsinstead of recomputing it — should be a quick fix.
async_scheduler.py:284-299 — _run_seeds_complete_check iterates _active_rgs while callbacks may mutate it indirectly
- What: The method iterates over
self._active_rgs(a list of tuples). Theon_seeds_completecallback can raise, causingdrop_rowcalls. Whiledrop_rowdoesn't directly modify_active_rgs, if a future change to the drop logic triggers a checkpoint (which does remove from_active_rgs), this would cause a mutation-during-iteration bug. - Why: Fragile iteration pattern. The current code is safe today, but the coupling is subtle.
- Suggestion: Iterate over a snapshot:
for rg_id, rg_size in list(self._active_rgs):— cheap insurance.
column_wise_builder.py:314-319 — Telemetry exception silently swallowed with bare except Exception: pass
- What: The telemetry emission block catches all exceptions and discards them silently. This is a pre-existing pattern from the sync path, but it's now replicated in the new async path.
- Why: If telemetry breaks (e.g., wrong snapshot format after refactoring), there's zero signal — no log, no metric. A
logger.debugwould cost nothing and save debugging time. - Suggestion: Add minimal logging:
except Exception: logger.debug("Failed to emit batch telemetry for async run", exc_info=True)
async_scheduler.py:260-274 — on_before_checkpoint failure drops all rows including already-generated data
- What: When
on_before_checkpointraises, the handler drops every row in the row group. Theon_before_checkpointcallback runs after all column generation is complete, so this discards fully-generated data. This matches theon_seeds_completefailure pattern (which makes sense there since no downstream work has happened yet), but here the trade-off is different. - Why: This is a reasonable design choice — if the post-batch transform is considered mandatory, dropping is correct. Just want to flag it as worth a brief comment explaining the rationale, since it's not immediately obvious why we wouldn't fall back to checkpointing the pre-transform data.
- Suggestion: A one-line comment above the drop loop would be enough, e.g.
# Post-batch is mandatory; drop rather than checkpoint unprocessed data.
column_wise_builder.py:277-280 — on_before_checkpoint passes rg_id as current_batch_number to run_post_batch
- What: The closure passes
current_batch_number=rg_id. In the sync path,current_batch_numberis a sequential 0-based batch index. In the async path,rg_idis also 0-based and sequential, so this works today. - Why: If row-group IDs ever become non-sequential (e.g., retried row groups get new IDs), this would silently pass incorrect batch numbers to processors that depend on them.
- Suggestion: Low urgency — either document that
rg_idmust remain sequential, or introduce a separate batch counter for the async path.
Suggestions — Consider improving
test_async_engine.py:26-27 — Mutating module-level global for test isolation
- What: The e2e test directly mutates
cwb.DATA_DESIGNER_ASYNC_ENGINE = Trueand restores it in afinallyblock. - Why: Fragile if tests run concurrently in the same process, and won't restore on
KeyboardInterrupt. - Suggestion:
unittest.mock.patch.object(cwb, "DATA_DESIGNER_ASYNC_ENGINE", True)as a context manager would be safer.
async_scheduler.py:91-92 — _in_flight_counts can theoretically go negative
- What: The counter is decremented in the
finallyblock of_execute_task_inner. If a task were somehow processed twice, the counter could go below zero. - Why: A negative count would cause
_in_flight_for_rgto returnFalseprematurely, potentially triggering_run_seeds_complete_checktoo early. - Suggestion: Add a floor:
max(0, self._in_flight_counts.get(task.row_group, 0) - 1). Purely defensive.
column_wise_builder.py:68 — Importing private function _ensure_async_engine_loop
- What: The builder imports
_ensure_async_engine_loop(underscore-prefixed) fromasync_concurrency. - Why: Crosses a module boundary using a private function.
- Suggestion: Consider renaming to
ensure_async_engine_loop(public) or re-exporting from the module's__init__.
test_async_builder_integration.py:108 — Import inside async test function
- What:
test_build_async_end_to_endandtest_checkpoint_produces_correct_parquet_callsimportAsyncTaskSchedulerinside the test body. - Why: AGENTS.md prefers module-level imports.
- Suggestion: Move to module level or use
pytest.importorskip. Minor nit.
What Looks Good
-
Thorough callback wiring: The pre-batch and post-batch processor callbacks are cleanly integrated with proper error handling and row-drop synchronization between
RowGroupBufferManagerandCompletionTracker. Theon_seeds_complete/on_before_checkpointseparation is a clean design. -
Responsive to review feedback: All prior review comments have been addressed in follow-up commits —
on_batch_completewired through,replace_dataframedrop handling, publicis_column_complete_for_rgandrun_pre_batch_on_dfmethods, pre-batch barrier gating,_admitted_rg_idspruning,disable_early_shutdownforwarding. The commit history tells a clear story of iterative improvement. -
Well-structured tests: The new test files cover the key integration scenarios (multi-column pipeline, checkpoint correctness,
allow_resizeguard, pre-batch failure, error rate shutdown) with clean setup and meaningful assertions on observable outcomes. The e2e concurrency test using trace interval overlap is a clever verification approach.
Verdict
Ship it (with nits) — The core integration is solid, all prior critical feedback has been addressed, the sync path is untouched, and test coverage is good. The warnings above are mostly about defensive coding and documentation — none are blocking. Nice work getting this across the finish line!
Summary
Final integration PR in the async engine series (Plan #346). Wires the
AsyncTaskSchedulerand supporting components (built in PRs #356 and #404) intoColumnWiseDatasetBuilder. The async path is gated behindDATA_DESIGNER_ASYNC_ENGINE=1- the sync path is unchanged.Closes #346
Changes
Added
column_wise_builder.py-_build_async(): buildsExecutionGraph, partitions rows into groups, createsCompletionTracker+RowGroupBufferManager, runsAsyncTaskScheduleron the background loop_validate_async_compatibility(): raisesDatasetGenerationErrorat startup if any column usesallow_resize=Truewith async engineon_seeds_complete,on_before_checkpoint)on_batch_completecallback wired throughon_row_group_completefor progress notificationstask_tracesproperty onColumnWiseDatasetBuilderandDataDesigner- populated whenrun_config.async_trace=Truerun_config.async_trace: boolfield for enabling per-task tracingtest_async_builder_integration.py: integration tests for multi-column pipelines, checkpoint correctness,allow_resizeguard, pre-batch failure, error rate shutdownChanged
async_scheduler.py: error rate shutdown (shutdown_error_rate,shutdown_error_window,disable_early_shutdown), pre-batch seed-completion tracking,_admitted_rg_idspruning on checkpoint,_in_flight_for_rghelpertest_async_scheduler.py: expanded coverage for error rate shutdown and new callbacksFixed
RowGroupBufferManagertoCompletionTracker, preventing unnecessary LLM calls on filtered rows (c650a2f)_admitted_rg_idspruned on checkpoint to prevent unbounded growth (71e7412)disable_early_shutdownwired intoAsyncTaskScheduler- was silently ignored in the async path (259828d)Attention Areas
column_wise_builder.py-_build_async- main wiring; verify generator map construction forMultiColumnConfigand row group partitioning matches sync path semanticsasync_scheduler.py- error rate shutdown - window-based rate uses cumulative counts, not a sliding window (intentional for v1 simplicity)Follow-up
ColumnWiseDatasetBuildertoDatasetBuilderDescription updated with AI