Skip to content

Conversation

@PointKernel
Copy link
Member

@PointKernel PointKernel commented Oct 28, 2025

Description

Depends on #20703

Contributes to #20165 and #20301

This PR introduces a new API, filter_join_indices, which can be used in combination with standard hash or sort joins. For example, a mixed_inner_join can be replaced with hash_join::inner_join followed by filter_join_indices to achieve equivalent behavior. This results in a significant performance improvement over the current mixed join implementation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 28, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Oct 28, 2025
@GregoryKimball GregoryKimball requested a review from shrshi October 28, 2025 21:41
@GregoryKimball GregoryKimball moved this to Burndown in libcudf Oct 28, 2025
@PointKernel PointKernel added feature request New feature or request Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change 2 - In Progress Currently a work in progress labels Oct 30, 2025
@karthikeyann karthikeyann moved this from Burndown to Slip in libcudf Nov 12, 2025
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review:

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this looks great. It just does not cover all of my use cases and if I need to file follow on issues to get those use cases covered I am happy to do so.

*/
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
filter_join_indices(cudf::table_view const& left,
Copy link
Contributor

@revans2 revans2 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the long response. I want to make sure that you understand the use cases I really want to cover and why this API works okay but does not really cover all of the use cases I have. I am fine if we do other work as a follow on at some point, but I want my very generic use cases understood.

My main concern is that we want to use the inner + post processing pattern for joins that do not include an AST component. By having the AST and the post processing combined together into a single API call I don't know how to not take a hit to turn an inner join into a left outer join. That said from reading the code it is clear why you want them combined from a performance standpoint. This code is much better than what I am currently using.

I am just hopeful that we can get another API to turn an inner join result into the other join types. I can file a follow on issue for that.

The second issue I have, and again I can file a follow on issue for this is that for Spark Rapids we have to be able to support a join that is larger than a single table on each side.
We get a lot of flexibility from having the inner join be the basis for all of this. By using an inner join with AST post processing, in the worst case when both sides of the join are so large neither fits totally in memory, or at least not as a single table, then we can do cross join like semantics L1 joined with R1, L1 joined with R2, L2 joined with R1, and L2 joined with R2. That works great for an inner join because we can just output each batch when we are done. It does not work for any other join type unless we can track externally what happened. For example with a left outer join. I would output the results of each inner join as they happened, but I would also need to keep track of a boolean column for which rows on the left actually matched something. So for the example above I would need something like matched(L1 X R1) OR matched(L1 X R2) for the L1 table. Then when I am done with joining L1 with all of the tables in R* I would filter the L1 columns to keep only what didn't match anything in R1 or R2, add in null columns for the R schema and output that. For FULL outer we would track matches for both sides, for SEMI/ANTI we would not output any result from the inner join and just output the filtered columns at the end.

To be clear this API does not stop us from doing this. It just means we have to write our own kernels to take the gather map and turn it into this boolean column (which I have done on a hacked up branch, but my cuda skills are not great), and for that use case we would end up likely always calling this API with the join_kind being inner, even if it is not, just to skip over the post processing for the other join types.

To be clear we are not likely to do a cross join like processing very often. Generally we have a goal of getting one side of the join to fit in a table, and then streaming through the other side of the join. But in those cases we still end up limited by the build side and the stream side, which can be a problem and converting the gather maps to a boolean column makes it possible to select the build and stream side for any join I want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed clarification. I now have a better overall picture of what needs to be done to fully support these Spark use cases. I opened this PR primarily to address part of the comment from #20165 (comment) about using a normal hash/sort join followed by AST-based post filtering to emulate mixed join behavior:

Update: further discussion today covered the idea that a "mixed" join use a hash probe to create a gather map and then assess predicate for each element in the gather map. Generating the gather map doesn't take too much memory space, and even for exploding joins we can use a chunked probe, so running AST operators with a gather map might end up being a more efficient "mixed" join strategy.

My understanding is that this could improve mixed join performance and potentially help with #20301, where a sort-based left join combined with filter_join_indices could eliminate the mixed join hang.

Since libcudf doesn’t yet support sort-based left join, I’m wondering: would it be better to add left-join support to sort-merge join to properly close #20301, or should we rely on sort-based inner joins plus post-processing to implement left joins? As you noted, all join types can be derived from an inner join with the right post-processing.

On that note, regarding the post-processing mentioned in #20165, @revans2 could you please create a sub-issue under #20165 outlining which join types you want the post-processing API to handle (e.g., left, full, left-semi, left-anti)? And it would also be helpful to add another sub-issue describing the expected approach for generating a per-row match mask based on the join kind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened a PR to support left joins with the sort-merge algorithm #20787
The performance bug in #20301 is indeed eliminated with a sort-based join + post filter approach - the test completes in ~1.1s (#20802)

@PointKernel PointKernel requested a review from shrshi December 3, 2025 22:14
Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>
@PointKernel PointKernel requested a review from ttnghia December 8, 2025 21:19
@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit e27d58e into rapidsai:main Dec 8, 2025
139 checks passed
@PointKernel PointKernel deleted the filter-gather-map branch December 8, 2025 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants