-
Notifications
You must be signed in to change notification settings - Fork 999
Add filter_join_indices #20385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filter_join_indices #20385
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
shrshi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review:
revans2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective this looks great. It just does not cover all of my use cases and if I need to file follow on issues to get those use cases covered I am happy to do so.
| */ | ||
| std::pair<std::unique_ptr<rmm::device_uvector<size_type>>, | ||
| std::unique_ptr<rmm::device_uvector<size_type>>> | ||
| filter_join_indices(cudf::table_view const& left, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the long response. I want to make sure that you understand the use cases I really want to cover and why this API works okay but does not really cover all of the use cases I have. I am fine if we do other work as a follow on at some point, but I want my very generic use cases understood.
My main concern is that we want to use the inner + post processing pattern for joins that do not include an AST component. By having the AST and the post processing combined together into a single API call I don't know how to not take a hit to turn an inner join into a left outer join. That said from reading the code it is clear why you want them combined from a performance standpoint. This code is much better than what I am currently using.
I am just hopeful that we can get another API to turn an inner join result into the other join types. I can file a follow on issue for that.
The second issue I have, and again I can file a follow on issue for this is that for Spark Rapids we have to be able to support a join that is larger than a single table on each side.
We get a lot of flexibility from having the inner join be the basis for all of this. By using an inner join with AST post processing, in the worst case when both sides of the join are so large neither fits totally in memory, or at least not as a single table, then we can do cross join like semantics L1 joined with R1, L1 joined with R2, L2 joined with R1, and L2 joined with R2. That works great for an inner join because we can just output each batch when we are done. It does not work for any other join type unless we can track externally what happened. For example with a left outer join. I would output the results of each inner join as they happened, but I would also need to keep track of a boolean column for which rows on the left actually matched something. So for the example above I would need something like matched(L1 X R1) OR matched(L1 X R2) for the L1 table. Then when I am done with joining L1 with all of the tables in R* I would filter the L1 columns to keep only what didn't match anything in R1 or R2, add in null columns for the R schema and output that. For FULL outer we would track matches for both sides, for SEMI/ANTI we would not output any result from the inner join and just output the filtered columns at the end.
To be clear this API does not stop us from doing this. It just means we have to write our own kernels to take the gather map and turn it into this boolean column (which I have done on a hacked up branch, but my cuda skills are not great), and for that use case we would end up likely always calling this API with the join_kind being inner, even if it is not, just to skip over the post processing for the other join types.
To be clear we are not likely to do a cross join like processing very often. Generally we have a goal of getting one side of the join to fit in a table, and then streaming through the other side of the join. But in those cases we still end up limited by the build side and the stream side, which can be a problem and converting the gather maps to a boolean column makes it possible to select the build and stream side for any join I want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed clarification. I now have a better overall picture of what needs to be done to fully support these Spark use cases. I opened this PR primarily to address part of the comment from #20165 (comment) about using a normal hash/sort join followed by AST-based post filtering to emulate mixed join behavior:
Update: further discussion today covered the idea that a "mixed" join use a hash probe to create a gather map and then assess predicate for each element in the gather map. Generating the gather map doesn't take too much memory space, and even for exploding joins we can use a chunked probe, so running AST operators with a gather map might end up being a more efficient "mixed" join strategy.
My understanding is that this could improve mixed join performance and potentially help with #20301, where a sort-based left join combined with filter_join_indices could eliminate the mixed join hang.
Since libcudf doesn’t yet support sort-based left join, I’m wondering: would it be better to add left-join support to sort-merge join to properly close #20301, or should we rely on sort-based inner joins plus post-processing to implement left joins? As you noted, all join types can be derived from an inner join with the right post-processing.
On that note, regarding the post-processing mentioned in #20165, @revans2 could you please create a sub-issue under #20165 outlining which join types you want the post-processing API to handle (e.g., left, full, left-semi, left-anti)? And it would also be helpful to add another sub-issue describing the expected approach for generating a per-row match mask based on the join kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
|
/merge |
Description
Depends on #20703
Contributes to #20165 and #20301
This PR introduces a new API,
filter_join_indices, which can be used in combination with standard hash or sort joins. For example, amixed_inner_joincan be replaced withhash_join::inner_joinfollowed byfilter_join_indicesto achieve equivalent behavior. This results in a significant performance improvement over the current mixed join implementation.Checklist