Allow sort merge join to go above int32 output row limits #20960

revans2 · 2025-12-31T19:00:39Z

Description

Update sort merge join so it can output more than int32 output rows.

Hash based joins can already do this, but sort merge join would get an illegal memory access error (or might end up corrupting data)

I didn't add new tests because existing benchmarks would cover it.

JOIN_NVBENCH --rmm_mode async --benchmark 'high_multiplicity_inner_join' --axis 'Algorithm=SORT_MERGE'

When run on a 48 GiB RTX a6000 the benchmark would crash with

Run:  [207/1260] high_multiplicity_inner_join [Device=0 Nullable=0 NullEquality=NULLS_EQUAL DataType=INT32 Algorithm=SORT_MERGE num_keys=2 left_size=10000000 right_size=100000 multiplicity=1000]
/home/roberte/src/spark-rapids-jni/target/jni/cmake-build/_deps/nvbench-src/nvbench/detail/timestamps_kernel.cu:62: Cuda API call returned error: cudaErrorIllegalAddress: an illegal memory access was encountered
Command: 'cudaHostUnregister(&m_host_timestamps)'

But after this change all of the benchmarks complete. I did some spot checks on benchmarks that ran with the fix and also without it to see if there was much overhead associated with the fix and it appears to be < 1% of the runtime.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 · 2025-12-31T19:05:23Z

I do want to call out that I am not a cuda expert at all and most of this was done with the help of AI. So please let me know if I did anything wrong at all.

cpp/src/join/sort_merge_join.cu

revans2 · 2025-12-31T20:54:25Z

The failed pandas test appears to be totally unrelated to my PR

FAILED tests/groupby/test_groupby.py::test_obj_arg_get_group_deprecated - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
= 1 failed, 200729 passed, 9140 skipped, 8715 xfailed, 85 xpassed, 18008 warnings in 1907.04s (0:31:47) =

Once I get some review feedback I will upmerge and try the tests again.

cpp/src/join/sort_merge_join.cu

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 · 2026-01-06T17:19:05Z

@ttnghia and @PointKernel could you take another look. I updated the code to avoid issues with 32-bit indexes (or I should say AI did it for me). I ran the performance benchmarks and the performance difference between the two was minimal. The new code appears to use more memory than the old code did, so I got a few cases where the test failed with out of memory allocation which didn't happen on the thrust 64-bit code. But in either case we are not touching memory we should not be so that is a win. I would rather have an exception I can catch and deal with than potentially data corruption.

cpp/src/join/sort_merge_join.cu

shrshi · 2026-01-06T20:28:53Z

cpp/src/join/sort_merge_join.cu

-                         larger_indices.begin(),
-                         thrust::maximum<size_type>{});
+
+  // Scatter in chunks to handle large arrays (> INT32_MAX)


Can we replace the scatter + max scan with a single cub::DeviceCopy::Batched? It supports 64-bit operations (NVIDIA/cccl#50 (comment)). In fact, I think we can directly use the run-length decoding example (https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceCopy.html#snippet) that cccl provides

Sorry I don't understand the logic here. How scatter + max scan can be replaced by copying? Can you provide a pseudo code?

@ttnghia I've implemented the idea here: https://github.com/shrshi/cudf/blob/30374a137c51e9a7e3576a5bbd3b65d2aa9d1fca/cpp/src/join/sort_merge_join.cu#L210

Since cub::DeviceCopy::Batched copies from multiple source ranges, if we can provide a constant iterator for each of the source ranges, we would be achieving the same behavior as a scatter followed by max scan. That is, if we provide to the copying API a constant iterator pointing to each of the nonzero match indices, along with the correct output positions in larger_indices array, we would get the desired result.

Done. Thanks for the suggestion.

cpp/src/join/sort_merge_join.cu

# Conflicts: # cpp/src/join/sort_merge_join.cu

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

…ssues Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

cpp/src/join/sort_merge_join.cu

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

shrshi

Really nice work, LGTM!

PointKernel

Looks great.

Thank you @shrshi for the thorough review and @ttnghia for helping address the review comments.

ttnghia · 2026-01-07T21:55:15Z

@revans2 Did you test the latest commit? If so, it is safe to go.

revans2

The changes look good. I reran the benchmark locally and I didn't see any thrust issues or walking off the end of memory. I still want to compare the results with the original code, but I think it is good to go now.

revans2 · 2026-01-08T15:43:02Z

The change looks great. The performance is basically the same as before. At least on the subset of the benchmarks I could get to run without issues on the old code.

PointKernel · 2026-01-08T18:00:46Z

/merge

…0960) Update sort merge join so it can output more than int32 output rows. Hash based joins can already do this, but sort merge join would get an illegal memory access error (or might end up corrupting data) I didn't add new tests because existing benchmarks would cover it. ``` JOIN_NVBENCH --rmm_mode async --benchmark 'high_multiplicity_inner_join' --axis 'Algorithm=SORT_MERGE' ``` When run on a 48 GiB RTX a6000 the benchmark would crash with ``` Run: [207/1260] high_multiplicity_inner_join [Device=0 Nullable=0 NullEquality=NULLS_EQUAL DataType=INT32 Algorithm=SORT_MERGE num_keys=2 left_size=10000000 right_size=100000 multiplicity=1000] /home/roberte/src/spark-rapids-jni/target/jni/cmake-build/_deps/nvbench-src/nvbench/detail/timestamps_kernel.cu:62: Cuda API call returned error: cudaErrorIllegalAddress: an illegal memory access was encountered Command: 'cudaHostUnregister(&m_host_timestamps)' ``` But after this change all of the benchmarks complete. I did some spot checks on benchmarks that ran with the fix and also without it to see if there was much overhead associated with the fix and it appears to be < 1% of the runtime. Authors: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Shruti Shivakumar (https://github.com/shrshi) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#20960

Allow sort merge join to go above int32 output row limits

833c191

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

revans2 requested a review from a team as a code owner December 31, 2025 19:00

revans2 requested review from PointKernel and vyasr December 31, 2025 19:00

revans2 self-assigned this Dec 31, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 31, 2025

revans2 added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change Reliability labels Dec 31, 2025

revans2 mentioned this pull request Dec 31, 2025

Add in key remapping for improved sort merge join performance #20826

Merged

3 tasks

PointKernel reviewed Dec 31, 2025

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 2, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

revans2 added 4 commits January 6, 2026 09:11

Merge branch 'main' into sort_merge_join_fix

eb9a455

Address review comment about scalar

f2c4f64

Attempt to work around diabled THRUST 64-bit index support

d03b5cb

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Style fixes

60a4c00

ttnghia reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

PointKernel requested a review from shrshi January 6, 2026 18:16

ttnghia reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

ttnghia reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Show resolved Hide resolved

ttnghia reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

shrshi reviewed Jan 6, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Show resolved Hide resolved

Merge branch 'main' into sort_merge_join_fix

c67db4b

# Conflicts: # cpp/src/join/sort_merge_join.cu

ttnghia added 2 commits January 6, 2026 14:56

Replace some thrust APIs by cub

deb7b73

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Fix offsets and transform

7c15906

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

github-actions bot assigned ttnghia Jan 6, 2026

ttnghia and others added 3 commits January 6, 2026 15:44

Use Fill instead of ForEachN

2e9e52c

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Fixed a few other thrust calls that were causing 32-bit index limit i…

4dbe8ab

…ssues Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Merge branch 'main' into sort_merge_join_fix

9a999f3

PointKernel reviewed Jan 7, 2026

View reviewed changes

cpp/src/join/sort_merge_join.cu Outdated Show resolved Hide resolved

Update cpp/src/join/sort_merge_join.cu

b48a44d

github-actions bot assigned PointKernel Jan 7, 2026

ttnghia and others added 3 commits January 7, 2026 11:42

Rewrite scatter + max scan by batch copy

68b6d12

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Wrap code block to isolate temporary variables

5beb890

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Merge branch 'main' into sort_merge_join_fix

7c4da50

shrshi approved these changes Jan 7, 2026

View reviewed changes

ttnghia requested a review from PointKernel January 7, 2026 20:38

PointKernel approved these changes Jan 7, 2026

View reviewed changes

revans2 commented Jan 7, 2026

View reviewed changes

Merge branch 'main' into sort_merge_join_fix

fa69ff1

ttnghia approved these changes Jan 8, 2026

View reviewed changes

rapids-bot bot merged commit b65f02a into rapidsai:main Jan 8, 2026
148 checks passed

Allow sort merge join to go above int32 output row limits #20960

Allow sort merge join to go above int32 output row limits #20960

Conversation

revans2 commented Dec 31, 2025

Description

Checklist

Uh oh!

revans2 commented Dec 31, 2025

Uh oh!

Uh oh!

revans2 commented Dec 31, 2025

Uh oh!

Uh oh!

revans2 commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shrshi Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ttnghia Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

shrshi Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

ttnghia Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shrshi left a comment

Choose a reason for hiding this comment

Uh oh!

PointKernel left a comment

Choose a reason for hiding this comment

Uh oh!

ttnghia commented Jan 7, 2026

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 commented Jan 8, 2026

Uh oh!

PointKernel commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants