join: count effort on inputs to the results closure
#389
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, the join operator's effort counting was based on the outputs of the passed
resultsclosure. This had the unfortunate effect that fueling would become ineffective in cases where theresultsclosure always returns empty iterators. In such scenarios, the join operator would only yield once it has exhausted its inputs, negatively impacting concurrent operators and possibly application interactivity.Having the
resultsclosure return empty iterators is useful when the caller does not care about the results of a join anymore (e.g. when the dataflow is shutting down). By returning nothing fromresults, the updates queued up before the join can be drained quickly, without feeding additional updates to downstream operators.This commit attempts to improve the situation by changing the way the join operator counts its effort. Instead of counting the number of
resultsoutputs, we can count the number of inputs. That way, even ifresultsdecides to stop emitting updates, join fueling continues to work as expected.The new behavior is consistent with the half join operator from
dogsdogsdogs, which also uses the input, rather than the output, of theoutput_funcfor its work counting.Motivation
The specific motivation for this PR is MaterializeInc/materialize#18927, in which we attempt to speed up shutdown of join dataflows by stopping emission of updates from the join closure once the dataflow cancellation was observed. This strategy works nicely with delta joins but degrades interactivity with linear joins due to the different effort counting behavior.
Performance
I added performance measurements for this change to the "Linear Join" section of MaterializeInc/materialize#18927 (comment). In summary, we should expect ~1% of slowdown due to incrementing
effortevery time we invoke the join closure, rather than only once at the end.I was wondering if we could instead estimate the effort upfront using:
That would probably mitigate the performance impact, but might also be incorrect. It looks like the thinker spends quite some effort to avoid a quadratic output.
Relevant Slack discussion.