Skip to content

Conversation

@Nachiket-Roy
Copy link
Contributor

@Nachiket-Roy Nachiket-Roy commented Jan 11, 2026

Which issue does this PR close?

Related to #19679
Related to #19695

Rationale for this change

It was observed that downstream operators (such as SortExec) can receive extremely large RecordBatch outputs from AggregateExec, violating the implicit batch_size convention used throughout the physical plan.
An initial hypothesis was that normalizing batch sizes at the AggregateExec -> SortExec boundary would be sufficient.

However, while implementing and testing this, it became clear that:

  • GroupedHashAggregateStream allocates group keys and accumulator state before memory reservation checks are enforced
  • By the time emission, spilling, or streaming merge logic runs, memory pressure may already be unavoidable
  • Even when output batches are sliced to batch_size, internal state can exceed memory limits
  • As a result, downstream operators still observe oversized memory usage

This indicates that the issue is not a simple emission bug, but a design boundary between:

  • aggregation state growth
  • memory reservation timing
  • batch_size as an output convention vs. a hard invariant

This PR aims to:

  • reduce memory pressure in specific transitions
  • make failure modes more predictable
  • provide a basis for further discussion

What changes are included in this PR?

  • Releases hash-aggregation memory before switching to streaming merge
  • Ensures memory reservations are explicitly reset when transitioning execution phases
  • Avoids retaining aggregation state after it is no longer needed
  • Tightens execution-state transitions to prevent double emission and late memory checks
  • Improves internal invariants around spill and merge behavior

Importantly, this PR does not claim to fully enforce batch_size at all stages, nor does it prevent all oversized intermediate states. Instead, it reduces memory pressure and clarifies current limitations.

Are these changes tested?

Yes, but edge cases revealed design limitations rather than simple bugs
The PR is submitted primarily for review and discussion rather than as a final fix.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jan 11, 2026
@Nachiket-Roy
Copy link
Contributor Author

Nachiket-Roy commented Jan 11, 2026

cc @2010YOUY01 @EmilyMatt Please go through the PR description Thanks!

@Omega359
Copy link
Contributor

Can you take a look at the failing tests please?

@Nachiket-Roy
Copy link
Contributor Author

Can you take a look at the failing tests please?

The failing tests point to an underlying issue in this approach that doesn’t seem fixable with a small change alone and likely needs an architectural decision. I’ve added more context in the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants