join: count effort on inputs to the `results` closure #389

teskje · 2023-04-25T10:49:59Z

Previously, the join operator's effort counting was based on the outputs of the passed results closure. This had the unfortunate effect that fueling would become ineffective in cases where the results closure always returns empty iterators. In such scenarios, the join operator would only yield once it has exhausted its inputs, negatively impacting concurrent operators and possibly application interactivity.

Having the results closure return empty iterators is useful when the caller does not care about the results of a join anymore (e.g. when the dataflow is shutting down). By returning nothing from results, the updates queued up before the join can be drained quickly, without feeding additional updates to downstream operators.

This commit attempts to improve the situation by changing the way the join operator counts its effort. Instead of counting the number of results outputs, we can count the number of inputs. That way, even if results decides to stop emitting updates, join fueling continues to work as expected.

The new behavior is consistent with the half join operator from dogsdogsdogs, which also uses the input, rather than the output, of the output_func for its work counting.

Motivation

The specific motivation for this PR is MaterializeInc/materialize#18927, in which we attempt to speed up shutdown of join dataflows by stopping emission of updates from the join closure once the dataflow cancellation was observed. This strategy works nicely with delta joins but degrades interactivity with linear joins due to the different effort counting behavior.

Performance

I added performance measurements for this change to the "Linear Join" section of MaterializeInc/materialize#18927 (comment). In summary, we should expect ~1% of slowdown due to incrementing effort every time we invoke the join closure, rather than only once at the end.

I was wondering if we could instead estimate the effort upfront using:

effort += thinker.history1.edits.len() * thinker.history2.edits.len();

That would probably mitigate the performance impact, but might also be incorrect. It looks like the thinker spends quite some effort to avoid a quadratic output.

Relevant Slack discussion.

Previously, the join operator's effort counting was based on the outputs of the passed `results` closure. This had the unfortunate effect that fueling would become ineffective in cases where the `results` closure always returns empty iterators. In such scenarios, the join operator would only yield once it has exhausted its inputs, negatively impacting concurrent operators and possibly application interactivity. Having the `results` closure return empty iterators is useful when the caller does not care about the results of a join anymore (e.g. when the dataflow is shutting down). By returning nothing from `results`, the updates queued up before the join can be drained quickly, without feeding additional updates to downstream operators. This commit attempts to improve the situation by changing the way the join operator counts its effort. Instead of counting the number of `results` outputs, we can count the number of inputs. That way, even if `results` decides to stop emitting updates, join fueling continues to work as expected. The new behavior is consistent with the half join operator from `dogsdogsdogs`, which also uses the input, rather than the output, of the `output_func` for its work counting.

teskje · 2023-05-23T13:48:28Z

Closing in favor of #390.

teskje mentioned this pull request Apr 25, 2023

compute: tokenize delta join operator MaterializeInc/materialize#18927

Merged

5 tasks

teskje marked this pull request as ready for review April 25, 2023 11:09

teskje mentioned this pull request Apr 26, 2023

Add a join_core_yielding operator #390

Open

teskje closed this May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

join: count effort on inputs to the `results` closure #389

join: count effort on inputs to the `results` closure #389

Uh oh!

teskje commented Apr 25, 2023 •

edited

Loading

Uh oh!

teskje commented May 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

join: count effort on inputs to the results closure #389

join: count effort on inputs to the results closure #389

Uh oh!

Conversation

teskje commented Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Performance

Uh oh!

teskje commented May 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

join: count effort on inputs to the `results` closure #389

join: count effort on inputs to the `results` closure #389

teskje commented Apr 25, 2023 •

edited

Loading