Add filter_join_indices #20385

PointKernel · 2025-10-28T00:49:58Z

Description

Depends on #20703

Contributes to #20165 and #20301

This PR introduces a new API, filter_join_indices, which can be used in combination with standard hash or sort joins. For example, a mixed_inner_join can be replaced with hash_join::inner_join followed by filter_join_indices to achieve equivalent behavior. This results in a significant performance improvement over the current mixed join implementation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-10-28T00:50:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shrshi

Partial review:

cpp/include/cudf/join/join.hpp

cpp/src/join/filter_join_indices.cu

cpp/src/join/filter_join_indices_kernel.cuh

cpp/src/join/filter_join_indices.cu

revans2

From my perspective this looks great. It just does not cover all of my use cases and if I need to file follow on issues to get those use cases covered I am happy to do so.

revans2 · 2025-12-02T15:43:24Z

cpp/include/cudf/join/join.hpp

+ */
+std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
+          std::unique_ptr<rmm::device_uvector<size_type>>>
+filter_join_indices(cudf::table_view const& left,


Sorry about the long response. I want to make sure that you understand the use cases I really want to cover and why this API works okay but does not really cover all of the use cases I have. I am fine if we do other work as a follow on at some point, but I want my very generic use cases understood.

My main concern is that we want to use the inner + post processing pattern for joins that do not include an AST component. By having the AST and the post processing combined together into a single API call I don't know how to not take a hit to turn an inner join into a left outer join. That said from reading the code it is clear why you want them combined from a performance standpoint. This code is much better than what I am currently using.

I am just hopeful that we can get another API to turn an inner join result into the other join types. I can file a follow on issue for that.

The second issue I have, and again I can file a follow on issue for this is that for Spark Rapids we have to be able to support a join that is larger than a single table on each side.
We get a lot of flexibility from having the inner join be the basis for all of this. By using an inner join with AST post processing, in the worst case when both sides of the join are so large neither fits totally in memory, or at least not as a single table, then we can do cross join like semantics L1 joined with R1, L1 joined with R2, L2 joined with R1, and L2 joined with R2. That works great for an inner join because we can just output each batch when we are done. It does not work for any other join type unless we can track externally what happened. For example with a left outer join. I would output the results of each inner join as they happened, but I would also need to keep track of a boolean column for which rows on the left actually matched something. So for the example above I would need something like matched(L1 X R1) OR matched(L1 X R2) for the L1 table. Then when I am done with joining L1 with all of the tables in R* I would filter the L1 columns to keep only what didn't match anything in R1 or R2, add in null columns for the R schema and output that. For FULL outer we would track matches for both sides, for SEMI/ANTI we would not output any result from the inner join and just output the filtered columns at the end.

To be clear this API does not stop us from doing this. It just means we have to write our own kernels to take the gather map and turn it into this boolean column (which I have done on a hacked up branch, but my cuda skills are not great), and for that use case we would end up likely always calling this API with the join_kind being inner, even if it is not, just to skip over the post processing for the other join types.

To be clear we are not likely to do a cross join like processing very often. Generally we have a goal of getting one side of the join to fit in a table, and then streaming through the other side of the join. But in those cases we still end up limited by the build side and the stream side, which can be a problem and converting the gather maps to a boolean column makes it possible to select the build and stream side for any join I want.

Thanks for the detailed clarification. I now have a better overall picture of what needs to be done to fully support these Spark use cases. I opened this PR primarily to address part of the comment from #20165 (comment) about using a normal hash/sort join followed by AST-based post filtering to emulate mixed join behavior:

Update: further discussion today covered the idea that a "mixed" join use a hash probe to create a gather map and then assess predicate for each element in the gather map. Generating the gather map doesn't take too much memory space, and even for exploding joins we can use a chunked probe, so running AST operators with a gather map might end up being a more efficient "mixed" join strategy.

My understanding is that this could improve mixed join performance and potentially help with #20301, where a sort-based left join combined with filter_join_indices could eliminate the mixed join hang.

Since libcudf doesn’t yet support sort-based left join, I’m wondering: would it be better to add left-join support to sort-merge join to properly close #20301, or should we rely on sort-based inner joins plus post-processing to implement left joins? As you noted, all join types can be derived from an inner join with the right post-processing.

On that note, regarding the post-processing mentioned in #20165, @revans2 could you please create a sub-issue under #20165 outlining which join types you want the post-processing API to handle (e.g., left, full, left-semi, left-anti)? And it would also be helpful to add another sub-issue describing the expected approach for generating a per-row match mask based on the join kind.

I have opened a PR to support left joins with the sort-merge algorithm #20787
The performance bug in #20301 is indeed eliminated with a sort-based join + post filter approach - the test completes in ~1.1s (#20802)

cpp/src/join/filter_join_indices.cu

cpp/benchmarks/CMakeLists.txt

Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>

cpp/benchmarks/join/join_common.hpp

cpp/include/cudf/join/join.hpp

cpp/src/join/filter_join_indices.cu

cpp/src/join/filter_join_indices_kernel.cuh

cpp/src/join/filter_join_indices_kernel.hpp

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

PointKernel · 2025-12-08T21:24:19Z

/merge

PointKernel added 4 commits October 23, 2025 17:04

Get rid of hashing helper header

7d83b94

Merge remote-tracking branch 'upstream/main'

f3da9ac

Add filter_gather_map

68b83ad

Remove redundant file

2eee15d

github-actions bot assigned PointKernel Oct 28, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Oct 28, 2025

PointKernel added 2 commits October 27, 2025 19:24

Updates

33de98a

Merge remote-tracking branch 'upstream/main' into filter-gather-map

904b743

GregoryKimball requested a review from shrshi October 28, 2025 21:41

GregoryKimball added this to libcudf Oct 28, 2025

GregoryKimball moved this to Burndown in libcudf Oct 28, 2025

Merge remote-tracking branch 'upstream/main' into filter-gather-map

52a77b6

PointKernel added feature request New feature or request Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change 2 - In Progress Currently a work in progress labels Oct 30, 2025

PointKernel added 9 commits October 30, 2025 10:40

Merge remote-tracking branch 'upstream/main' into filter-gather-map

7b77c83

Clean up tests

c700679

More cleanups

2cc63d1

Update throw

bbbc2db

Updates

401d8ce

Fix all tests

22d5f3d

Updates

bfd0e2c

Merge remote-tracking branch 'upstream/main' into filter-gather-map

2831810

cleanups

afb867d

karthikeyann moved this from Burndown to Slip in libcudf Nov 12, 2025

PointKernel added 2 commits November 17, 2025 09:58

Merge remote-tracking branch 'upstream/main' into filter-gather-map

a5899fd

Fix missing include

5386571

shrshi reviewed Dec 2, 2025

View reviewed changes

revans2 approved these changes Dec 2, 2025

View reviewed changes

PointKernel added 6 commits December 3, 2025 12:49

Merge remote-tracking branch 'upstream/main' into filter-gather-map

d6dbb99

Update docs

e8b7cff

Simplify inner join validity check

feb7d18

Cleanups + add benchmarks

03ebab2

Minor fix

e4e8193

Merge remote-tracking branch 'upstream/main' into filter-gather-map

44b5c97

PointKernel requested a review from shrshi December 3, 2025 22:14

cleanups

6c7ab00

shrshi approved these changes Dec 3, 2025

View reviewed changes

cpp/benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved

Update cpp/benchmarks/CMakeLists.txt

773d13b

Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>