Skip to content

Conversation

@revans2
Copy link
Contributor

@revans2 revans2 commented Dec 31, 2025

Description

Update sort merge join so it can output more than int32 output rows.

Hash based joins can already do this, but sort merge join would get an illegal memory access error (or might end up corrupting data)

I didn't add new tests because existing benchmarks would cover it.

JOIN_NVBENCH --rmm_mode async --benchmark 'high_multiplicity_inner_join' --axis 'Algorithm=SORT_MERGE'

When run on a 48 GiB RTX a6000 the benchmark would crash with

Run:  [207/1260] high_multiplicity_inner_join [Device=0 Nullable=0 NullEquality=NULLS_EQUAL DataType=INT32 Algorithm=SORT_MERGE num_keys=2 left_size=10000000 right_size=100000 multiplicity=1000]
/home/roberte/src/spark-rapids-jni/target/jni/cmake-build/_deps/nvbench-src/nvbench/detail/timestamps_kernel.cu:62: Cuda API call returned error: cudaErrorIllegalAddress: an illegal memory access was encountered
Command: 'cudaHostUnregister(&m_host_timestamps)'

But after this change all of the benchmarks complete. I did some spot checks on benchmarks that ran with the fix and also without it to see if there was much overhead associated with the fix and it appears to be < 1% of the runtime.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2 revans2 requested a review from a team as a code owner December 31, 2025 19:00
@revans2 revans2 requested review from PointKernel and vyasr December 31, 2025 19:00
@revans2 revans2 self-assigned this Dec 31, 2025
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 31, 2025
@revans2 revans2 added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change Reliability labels Dec 31, 2025
@revans2
Copy link
Contributor Author

revans2 commented Dec 31, 2025

I do want to call out that I am not a cuda expert at all and most of this was done with the help of AI. So please let me know if I did anything wrong at all.

@revans2
Copy link
Contributor Author

revans2 commented Dec 31, 2025

The failed pandas test appears to be totally unrelated to my PR

FAILED tests/groupby/test_groupby.py::test_obj_arg_get_group_deprecated - TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
= 1 failed, 200729 passed, 9140 skipped, 8715 xfailed, 85 xpassed, 18008 warnings in 1907.04s (0:31:47) =

Once I get some review feedback I will upmerge and try the tests again.

@revans2
Copy link
Contributor Author

revans2 commented Jan 6, 2026

@ttnghia and @PointKernel could you take another look. I updated the code to avoid issues with 32-bit indexes (or I should say AI did it for me). I ran the performance benchmarks and the performance difference between the two was minimal. The new code appears to use more memory than the old code did, so I got a few cases where the test failed with out of memory allocation which didn't happen on the thrust 64-bit code. But in either case we are not touching memory we should not be so that is a win. I would rather have an exception I can catch and deal with than potentially data corruption.

@PointKernel PointKernel requested a review from shrshi January 6, 2026 18:16
larger_indices.begin(),
thrust::maximum<size_type>{});

// Scatter in chunks to handle large arrays (> INT32_MAX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace the scatter + max scan with a single cub::DeviceCopy::Batched? It supports 64-bit operations (NVIDIA/cccl#50 (comment)). In fact, I think we can directly use the run-length decoding example (https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceCopy.html#snippet) that cccl provides

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand the logic here. How scatter + max scan can be replaced by copying? Can you provide a pseudo code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia I've implemented the idea here: https://github.com/shrshi/cudf/blob/30374a137c51e9a7e3576a5bbd3b65d2aa9d1fca/cpp/src/join/sort_merge_join.cu#L210

Since cub::DeviceCopy::Batched copies from multiple source ranges, if we can provide a constant iterator for each of the source ranges, we would be achieving the same behavior as a scatter followed by max scan. That is, if we provide to the copying API a constant iterator pointing to each of the nonzero match indices, along with the correct output positions in larger_indices array, we would get the desired result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the suggestion.

# Conflicts:
#	cpp/src/join/sort_merge_join.cu
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
ttnghia and others added 3 commits January 6, 2026 15:44
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
…ssues

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
ttnghia and others added 3 commits January 7, 2026 11:42
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work, LGTM!

@ttnghia ttnghia requested a review from PointKernel January 7, 2026 20:38
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

Thank you @shrshi for the thorough review and @ttnghia for helping address the review comments.

@ttnghia
Copy link
Contributor

ttnghia commented Jan 7, 2026

@revans2 Did you test the latest commit? If so, it is safe to go.

Copy link
Contributor Author

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. I reran the benchmark locally and I didn't see any thrust issues or walking off the end of memory. I still want to compare the results with the original code, but I think it is good to go now.

@revans2
Copy link
Contributor Author

revans2 commented Jan 8, 2026

The change looks great. The performance is basically the same as before. At least on the subset of the benchmarks I could get to run without issues on the old code.

2_baseline_vs_test_scatter

@PointKernel
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit b65f02a into rapidsai:main Jan 8, 2026
148 checks passed
galipremsagar pushed a commit to galipremsagar/cudf that referenced this pull request Jan 9, 2026
…0960)

Update sort merge join so it can output more than int32 output rows.

Hash based joins can already do this, but sort merge join would get an illegal memory access error (or might end up corrupting data)

I didn't add new tests because existing benchmarks would cover it.

```
JOIN_NVBENCH --rmm_mode async --benchmark 'high_multiplicity_inner_join' --axis 'Algorithm=SORT_MERGE'
```

When run on a 48 GiB RTX a6000 the benchmark would crash with

```
Run:  [207/1260] high_multiplicity_inner_join [Device=0 Nullable=0 NullEquality=NULLS_EQUAL DataType=INT32 Algorithm=SORT_MERGE num_keys=2 left_size=10000000 right_size=100000 multiplicity=1000]
/home/roberte/src/spark-rapids-jni/target/jni/cmake-build/_deps/nvbench-src/nvbench/detail/timestamps_kernel.cu:62: Cuda API call returned error: cudaErrorIllegalAddress: an illegal memory access was encountered
Command: 'cudaHostUnregister(&m_host_timestamps)'
```

But after this change all of the benchmarks complete.  I did some spot checks on benchmarks that ran with the fix and also without it to see if there was much overhead associated with the fix and it appears to be < 1% of the runtime.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#20960
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Reliability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants