Skip to content

feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder#429

Open
andreatgretel wants to merge 13 commits intomainfrom
andreatgretel/feat/async-builder-integration
Open

feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder#429
andreatgretel wants to merge 13 commits intomainfrom
andreatgretel/feat/async-builder-integration

Conversation

@andreatgretel
Copy link
Contributor

@andreatgretel andreatgretel commented Mar 17, 2026

Summary

Final integration PR in the async engine series (Plan #346). Wires the AsyncTaskScheduler and supporting components (built in PRs #356 and #404) into ColumnWiseDatasetBuilder. The async path is gated behind DATA_DESIGNER_ASYNC_ENGINE=1 - the sync path is unchanged.

Closes #346

Changes

Added

  • column_wise_builder.py - _build_async(): builds ExecutionGraph, partitions rows into groups, creates CompletionTracker + RowGroupBufferManager, runs AsyncTaskScheduler on the background loop
  • _validate_async_compatibility(): raises DatasetGenerationError at startup if any column uses allow_resize=True with async engine
  • Pre/post-batch processor callbacks wired into scheduler (on_seeds_complete, on_before_checkpoint)
  • on_batch_complete callback wired through on_row_group_complete for progress notifications
  • task_traces property on ColumnWiseDatasetBuilder and DataDesigner - populated when run_config.async_trace=True
  • run_config.async_trace: bool field for enabling per-task tracing
  • test_async_builder_integration.py: integration tests for multi-column pipelines, checkpoint correctness, allow_resize guard, pre-batch failure, error rate shutdown

Changed

  • async_scheduler.py: error rate shutdown (shutdown_error_rate, shutdown_error_window, disable_early_shutdown), pre-batch seed-completion tracking, _admitted_rg_ids pruning on checkpoint, _in_flight_for_rg helper
  • test_async_scheduler.py: expanded coverage for error rate shutdown and new callbacks

Fixed

  • Pre-batch row drops now synced from RowGroupBufferManager to CompletionTracker, preventing unnecessary LLM calls on filtered rows (c650a2f)
  • _admitted_rg_ids pruned on checkpoint to prevent unbounded growth (71e7412)
  • disable_early_shutdown wired into AsyncTaskScheduler - was silently ignored in the async path (259828d)

Attention Areas

Reviewers: please pay special attention to the following:

Follow-up


Description updated with AI

@andreatgretel andreatgretel requested a review from a team as a code owner March 17, 2026 21:20
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR completes the async engine integration series by wiring AsyncTaskScheduler into ColumnWiseDatasetBuilder behind the DATA_DESIGNER_ASYNC_ENGINE=1 environment gate. The sync path is fully unchanged. The integration is well-structured, with all callbacks (on_seeds_complete, on_before_checkpoint, on_row_group_complete) correctly connected and fixes from previous review iterations (early-shutdown checkpoint drain, _admitted_rg_ids pruning, pre-batch row drop sync, disable_early_shutdown wiring, is_column_complete_for_rg correctness) all in place.

Key changes:

  • _build_async() builds the ExecutionGraph, partitions rows into groups, and runs the scheduler on the background event loop via asyncio.run_coroutine_threadsafe
  • _validate_async_compatibility() eagerly rejects allow_resize=True columns, which are incompatible with cell-level buffer writes
  • on_before_checkpoint is correctly passed conditionally (only when POST_BATCH processors exist), but on_seeds_complete is always passed unconditionally — this sets has_pre_batch=True in the scheduler for every pipeline regardless of whether PRE_BATCH processors are configured, unnecessarily serializing dispatch of downstream tasks until all seeds across the row group complete
  • task_traces property added to both ColumnWiseDatasetBuilder and DatasetCreationResults, gated behind run_config.async_trace
  • CompletionTracker.is_column_complete_for_rg now correctly handles CELL_BY_CELL columns by checking _batch_complete first then doing a per-row validation
  • Integration and E2E tests added with appropriate coverage for multi-column pipelines, checkpoint correctness, and concurrency verification

Confidence Score: 4/5

  • Safe to merge — async path is gated behind an env var, sync path is untouched, and the one identified issue is a performance concern (not a data-correctness bug).
  • All previously-identified correctness issues (early-shutdown checkpoint drain, row-drop sync, _admitted_rg_ids unbounded growth, disable_early_shutdown wiring, is_column_complete_for_rg semantics) are addressed in this revision. The remaining issue — on_seeds_complete always passed unconditionally — is a performance concern (unnecessary pipeline serialization for pipelines with parallel seed columns and no PRE_BATCH processors) but not a data-correctness or data-loss bug. The async path is fully opt-in via DATA_DESIGNER_ASYNC_ENGINE=1, limiting blast radius.
  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py — the on_seeds_complete conditional guard on line 294

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Core async scheduler updated with error-rate shutdown, per-RG in-flight counters, pre-batch/post-batch callbacks, disable_early_shutdown, and _admitted_rg_ids pruning on checkpoint. The early-shutdown checkpoint fix (lines 141-143), is_column_complete_for_rg public method usage, and _admitted_rg_ids.discard are all correctly wired. The cumulative (non-sliding) error-rate window is intentional per PR description.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py Main integration point. _build_async correctly builds the generator map (including MultiColumnConfig expansion), partitions rows into groups, and wires all scheduler callbacks. on_before_checkpoint is correctly gated. However, on_seeds_complete is always passed unconditionally, which sets has_pre_batch=True in the scheduler for all pipelines — unnecessarily serializing dispatch of non-seed tasks until all seeds across the row group complete, even when no PRE_BATCH processors exist.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py Added is_column_complete_for_rg public method that correctly checks _batch_complete first (for FULL_COLUMN), then does a per-row validation for CELL_BY_CELL columns. Drop-row propagation to _reevaluate_batch_tasks is correct. No issues found.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py replace_dataframe correctly marks trailing buffer slots as dropped (line 85) when a processor returns fewer rows. get_dataframe filters dropped rows. checkpoint_row_group frees memory after write. No issues found.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py Good integration test coverage: multi-column pipeline with seed/cell/full-column generators, checkpoint parquet calls, allow_resize validation guard, and graph topology ordering. End-to-end async flow is exercised with @pytest.mark.asyncio.

Sequence Diagram

sequenceDiagram
    participant B as ColumnWiseDatasetBuilder
    participant L as BackgroundEventLoop
    participant S as AsyncTaskScheduler
    participant A as _admit_row_groups
    participant T as CompletionTracker
    participant M as RowGroupBufferManager

    B->>B: _validate_async_compatibility()
    B->>B: Build gen_map, ExecutionGraph
    B->>T: CompletionTracker.with_graph(graph, row_groups)
    B->>M: RowGroupBufferManager(artifact_storage)
    B->>S: AsyncTaskScheduler(generators, graph, tracker, row_groups, ...)
    B->>L: _ensure_async_engine_loop()
    B->>L: run_coroutine_threadsafe(scheduler.run())

    activate S
    S->>A: create_task(_admit_row_groups)
    activate A
    loop For each row_group
        A->>S: acquire rg_semaphore
        A->>M: init_row_group(rg_id, rg_size)
        A->>S: _dispatch_seeds(rg_id) → create_task(_execute_seed_task)
        A->>S: wake_event.set()
    end
    deactivate A

    loop Main dispatch loop
        S->>S: _run_seeds_complete_check(seed_cols)
        note over S: Fires on_seeds_complete when all seeds done & no in-flight
        S->>B: on_seeds_complete(rg_id, rg_size) [if PRE_BATCH processors]
        B->>M: get_dataframe(rg_id) → run_pre_batch → replace_dataframe
        B->>T: drop_row for newly-dropped rows
        S->>T: get_ready_tasks(dispatched, admitted_rgs)
        S->>S: create_task(_execute_task) for each ready task
        S->>S: _checkpoint_completed_row_groups(all_columns)
        note over S: Calls on_before_checkpoint if POST_BATCH processors present
        S->>B: on_before_checkpoint(rg_id, rg_size)
        B->>M: get_dataframe → run_post_batch → replace_dataframe
        S->>M: checkpoint_row_group(rg_id) → write parquet
        S->>S: rg_semaphore.release()
        S->>B: on_row_group_complete(rg_id) → on_batch_complete(path)
    end

    opt Salvage rounds (retryable failures)
        S->>S: Re-dispatch deferred tasks
        S->>S: _drain_frontier(seed_cols, has_pre_batch, all_columns)
        S->>S: _checkpoint_completed_row_groups(all_columns)
    end

    S-->>L: scheduler.run() returns
    L-->>B: future.result() returns
    B->>S: collect traces
    B->>M: write_metadata(target_num_records, buffer_size)
    deactivate S
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py
Line: 294

Comment:
**`on_seeds_complete` always passed, unnecessarily enabling `has_pre_batch` gate**

`on_seeds_complete` is always passed to `AsyncTaskScheduler` regardless of whether any `PRE_BATCH` processors are configured. Inside the scheduler, `has_pre_batch = self._on_seeds_complete is not None` evaluates to `True`, causing ALL non-seed tasks to be gated behind `_pre_batch_done_rgs` for every row group — even when no pre-batch processing is required.

The consequence is non-trivial for pipelines with multiple parallel seed columns: with `has_pre_batch = True`, both `seed_A` and `seed_B` must fully complete before either `downstream_X` or `downstream_Y` can begin dispatch. Without the gate, `downstream_X` would start as soon as `seed_A` finishes (if `seed_B` is not an upstream dependency of `downstream_X`). This prevents the async scheduler from pipelining work that could otherwise run in parallel.

`on_before_checkpoint` was fixed with the identical guard in this same PR (line 295-297). Apply the same treatment to `on_seeds_complete`:

```suggestion
            on_seeds_complete=(on_seeds_complete if self._processor_runner.has_processors_for(ProcessorStage.PRE_BATCH) else None),
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix: skip on_before_..."

@andreatgretel
Copy link
Contributor Author

(AR) Warning: on_seeds_complete not evaluated in salvage rounds

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:187

What: The seeds-complete check (lines 142-156) only runs inside the main while True loop. Salvage rounds retry deferred seed tasks but never check whether seeds completed, so the pre-batch processor is silently skipped for row groups whose seeds succeed in salvage.

Why: If all seed tasks for a row group are rate-limited on first attempt and succeed in salvage, downstream tasks run and the row group checkpoints without the pre-batch processor having run.

Suggestion:

Extract the seeds-complete check into a helper and call it in _drain_frontier or after salvage seed task completion.

@andreatgretel
Copy link
Contributor Author

(AR) This PR wires the AsyncTaskScheduler into ColumnWiseDatasetBuilder behind DATA_DESIGNER_ASYNC_ENGINE=1, adding _build_async(), pre/post-batch processor callbacks, error rate shutdown, and task trace surfacing. The sync path is unchanged. All 8 changed files pass ruff clean and 22 tests pass (6 integration + 16 scheduler). build_preview() async path (mentioned in the plan) was not implemented in this PR.

The most significant finding is a critical data-loss path in _checkpoint_completed_row_groups: if on_before_checkpoint raises, the row group's buffer is never freed and its data is silently dropped. Two warning-level findings identify scenarios where non-seed tasks can bypass the pre-batch processor barrier (multi-seed pipelines with partial dependencies, and salvage rounds skipping the seeds-complete check), both producing silently incorrect output. The async/sync code path separation is clean and the error rate shutdown mechanism is well-designed.

Verdict: needs-changes — 1 critical, 7 warnings, 4 suggestions.

- Wire on_batch_complete through on_row_group_complete callback
- Mark trailing slots as dropped in replace_dataframe when processor filters rows
- Ensure checkpoint still runs when on_before_checkpoint raises
- Gate non-seed task dispatch on pre-batch completion
- Add public run_pre_batch_on_df to ProcessorRunner (replaces private _run_stage call)
- Add public is_column_complete_for_rg to CompletionTracker (replaces private _completed access)
- Type task_traces as list[TaskTrace] in results.py
- Add async_trace docstring to RunConfig
- Move module-level log into _build_async
- Add replace_dataframe unit tests (same-size, dropped rows, fewer rows)
- Assert on public outcomes in scheduler tests instead of private attributes
- Parametrize allow_resize validation tests
- Cache seed_cols before main loop
- Remove redundant disable_early_shutdown from AsyncTaskScheduler
- Flush completed row groups before breaking on early shutdown (data loss)
- Change error rate check from >= to > so disable_early_shutdown sentinel
  (1.0) never triggers at 100% failure rate
- Extract seeds-complete check into helper and call it in salvage rounds
  via _drain_frontier, with pre-batch gating, so pre-batch processor runs
  even when seed tasks succeed only after retry
- Fix is_column_complete_for_rg to check _batch_complete first, then verify
  all non-dropped rows for CELL_BY_CELL columns
- Replace O(|in-flight|) scan in _in_flight_for_rg with per-RG counter
@andreatgretel
Copy link
Contributor Author

Addressed all remaining open issues in 3facfef:

  1. Early shutdown checkpoint skip (Greptile P2) - _checkpoint_completed_row_groups now called before break on early shutdown
  2. disable_early_shutdown at 100% failure (Greptile P2) - changed >= to > in _check_error_rate so the 1.0 sentinel never triggers
  3. on_seeds_complete skipped in salvage rounds (AR warning) - extracted seeds-complete check into _run_seeds_complete_check helper, called in _drain_frontier with pre-batch gating
  4. is_column_complete_for_rg partial cell correctness (Greptile P2) - now checks _batch_complete first, then verifies all non-dropped rows for CELL_BY_CELL
  5. _in_flight_for_rg O(|in-flight|) (Greptile perf) - replaced with per-RG counter dict for O(1) lookup

All 63 tests in test_async_scheduler, test_async_builder_integration, test_completion_tracker, and test_row_group_buffer pass.

… safely

Pre-batch processors that filter rows marked them as dropped in
RowGroupBufferManager but not in CompletionTracker, causing unnecessary
LLM calls for rows that would be discarded at checkpoint time.

Also wrap the benchmark warmup stderr redirect in try/finally so stderr
is restored if _run_once raises.
andreatgretel and others added 3 commits March 18, 2026 19:24
Prevents unbounded growth of the admission set across large runs.
Dev-time benchmarks and manual test scripts - kept locally, not needed in the PR.
@andreatgretel
Copy link
Contributor Author

@nabinchha Re: renaming ColumnWiseDatasetBuilder to DatasetBuilder - keeping it out of this PR to avoid noise on the async integration diff. Tracked in #437 (includes renaming the module file too).

RunConfig.disable_early_shutdown was forwarded to the sync executor
but silently ignored in the async path. Now passed through to the
scheduler's _check_error_rate.
Verifies the async scheduler dispatches independent LLM columns
concurrently by checking for overlapping task trace intervals.
Uses a wide DAG (sampler -> 2 parallel LLM columns) with 2 records.
Requires NVIDIA_API_KEY.
…g unprocessed data

Matches on_seeds_complete failure behavior and avoids silently
checkpointing unfiltered rows when a post-batch processor fails.
@andreatgretel
Copy link
Contributor Author

Response to Greptile summary (37e3b62):

  • on_before_checkpoint failure writes unprocessed data (P1): Fixed - now drops all rows in the row group instead of checkpointing unprocessed data, matching on_seeds_complete failure behavior.
  • Downstream FULL_COLUMN tasks for fully-dropped row groups (P2): Acknowledged - safe via skip guard, cost is one semaphore slot for one event-loop tick.
  • _run_batch O(rg_size) scan: Acknowledged - fine for current buffer sizes, can add a get_dropped_indices() accessor later if needed.

Avoids unnecessary DataFrame round-trip for every row group in the
common case where no post-batch processors exist.
max_conversation_correction_steps: Maximum number of correction rounds permitted within a
single conversation when generation tasks call `ModelFacade.generate(...)`. Must be >= 0.
Default is 0.
async_trace: If True, collect per-task tracing data when using the async engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-task here may need a little more context?

return False
return True

def is_column_complete_for_rg(self, column: str, row_group: int) -> bool:
Copy link
Contributor

@nabinchha nabinchha Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: row_group -> row_group_index

Copy link
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this integration, Andre — the callback wiring and the iterative response to prior feedback are both really well done. Here are my findings from a full read-through via claude.

Findings

Critical — Must fix before merge

None — the prior review's critical findings (silent on_batch_complete drop, replace_dataframe stale data, private attribute access, pre-batch barrier bypass, _run_stage cross-boundary call, bare list typing) have all been addressed in subsequent commits. Great job working through those.

Warnings — Strongly recommend fixing

async_scheduler.py:318 — Redundant seed_cols recomputation in _dispatch_seeds

  • What: _dispatch_seeds recomputes seed_cols from the graph's topological order on every call, but run() already computes seed_cols as a frozenset at the top of the method and passes it through _run_seeds_complete_check and _drain_frontier.
  • Why: Minor inefficiency (re-traverses the graph per row group admission), but more importantly a consistency risk — the two computations use different logic (get_topological_order + filter vs. columns + filter). If the graph ever distinguishes between these, they could diverge silently.
  • Suggestion: Pass the already-computed seed_cols frozenset into _dispatch_seeds instead of recomputing it — should be a quick fix.

async_scheduler.py:284-299_run_seeds_complete_check iterates _active_rgs while callbacks may mutate it indirectly

  • What: The method iterates over self._active_rgs (a list of tuples). The on_seeds_complete callback can raise, causing drop_row calls. While drop_row doesn't directly modify _active_rgs, if a future change to the drop logic triggers a checkpoint (which does remove from _active_rgs), this would cause a mutation-during-iteration bug.
  • Why: Fragile iteration pattern. The current code is safe today, but the coupling is subtle.
  • Suggestion: Iterate over a snapshot: for rg_id, rg_size in list(self._active_rgs): — cheap insurance.

column_wise_builder.py:314-319 — Telemetry exception silently swallowed with bare except Exception: pass

  • What: The telemetry emission block catches all exceptions and discards them silently. This is a pre-existing pattern from the sync path, but it's now replicated in the new async path.
  • Why: If telemetry breaks (e.g., wrong snapshot format after refactoring), there's zero signal — no log, no metric. A logger.debug would cost nothing and save debugging time.
  • Suggestion: Add minimal logging:
    except Exception:
        logger.debug("Failed to emit batch telemetry for async run", exc_info=True)

async_scheduler.py:260-274on_before_checkpoint failure drops all rows including already-generated data

  • What: When on_before_checkpoint raises, the handler drops every row in the row group. The on_before_checkpoint callback runs after all column generation is complete, so this discards fully-generated data. This matches the on_seeds_complete failure pattern (which makes sense there since no downstream work has happened yet), but here the trade-off is different.
  • Why: This is a reasonable design choice — if the post-batch transform is considered mandatory, dropping is correct. Just want to flag it as worth a brief comment explaining the rationale, since it's not immediately obvious why we wouldn't fall back to checkpointing the pre-transform data.
  • Suggestion: A one-line comment above the drop loop would be enough, e.g. # Post-batch is mandatory; drop rather than checkpoint unprocessed data.

column_wise_builder.py:277-280on_before_checkpoint passes rg_id as current_batch_number to run_post_batch

  • What: The closure passes current_batch_number=rg_id. In the sync path, current_batch_number is a sequential 0-based batch index. In the async path, rg_id is also 0-based and sequential, so this works today.
  • Why: If row-group IDs ever become non-sequential (e.g., retried row groups get new IDs), this would silently pass incorrect batch numbers to processors that depend on them.
  • Suggestion: Low urgency — either document that rg_id must remain sequential, or introduce a separate batch counter for the async path.

Suggestions — Consider improving

test_async_engine.py:26-27 — Mutating module-level global for test isolation

  • What: The e2e test directly mutates cwb.DATA_DESIGNER_ASYNC_ENGINE = True and restores it in a finally block.
  • Why: Fragile if tests run concurrently in the same process, and won't restore on KeyboardInterrupt.
  • Suggestion: unittest.mock.patch.object(cwb, "DATA_DESIGNER_ASYNC_ENGINE", True) as a context manager would be safer.

async_scheduler.py:91-92_in_flight_counts can theoretically go negative

  • What: The counter is decremented in the finally block of _execute_task_inner. If a task were somehow processed twice, the counter could go below zero.
  • Why: A negative count would cause _in_flight_for_rg to return False prematurely, potentially triggering _run_seeds_complete_check too early.
  • Suggestion: Add a floor: max(0, self._in_flight_counts.get(task.row_group, 0) - 1). Purely defensive.

column_wise_builder.py:68 — Importing private function _ensure_async_engine_loop

  • What: The builder imports _ensure_async_engine_loop (underscore-prefixed) from async_concurrency.
  • Why: Crosses a module boundary using a private function.
  • Suggestion: Consider renaming to ensure_async_engine_loop (public) or re-exporting from the module's __init__.

test_async_builder_integration.py:108 — Import inside async test function

  • What: test_build_async_end_to_end and test_checkpoint_produces_correct_parquet_calls import AsyncTaskScheduler inside the test body.
  • Why: AGENTS.md prefers module-level imports.
  • Suggestion: Move to module level or use pytest.importorskip. Minor nit.

What Looks Good

  • Thorough callback wiring: The pre-batch and post-batch processor callbacks are cleanly integrated with proper error handling and row-drop synchronization between RowGroupBufferManager and CompletionTracker. The on_seeds_complete / on_before_checkpoint separation is a clean design.

  • Responsive to review feedback: All prior review comments have been addressed in follow-up commits — on_batch_complete wired through, replace_dataframe drop handling, public is_column_complete_for_rg and run_pre_batch_on_df methods, pre-batch barrier gating, _admitted_rg_ids pruning, disable_early_shutdown forwarding. The commit history tells a clear story of iterative improvement.

  • Well-structured tests: The new test files cover the key integration scenarios (multi-column pipeline, checkpoint correctness, allow_resize guard, pre-batch failure, error rate shutdown) with clean setup and meaningful assertions on observable outcomes. The e2e concurrency test using trace interval overlap is a clever verification approach.

Verdict

Ship it (with nits) — The core integration is solid, all prior critical feedback has been addressed, the sync path is untouched, and test coverage is good. The warnings above are mostly about defensive coding and documentation — none are blocking. Nice work getting this across the finish line!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(engine): async generators and task-queue dataset builder

2 participants