-
Notifications
You must be signed in to change notification settings - Fork 1k
Allow sort merge join to go above int32 output row limits #20960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
|
I do want to call out that I am not a cuda expert at all and most of this was done with the help of AI. So please let me know if I did anything wrong at all. |
|
The failed pandas test appears to be totally unrelated to my PR Once I get some review feedback I will upmerge and try the tests again. |
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
|
@ttnghia and @PointKernel could you take another look. I updated the code to avoid issues with 32-bit indexes (or I should say AI did it for me). I ran the performance benchmarks and the performance difference between the two was minimal. The new code appears to use more memory than the old code did, so I got a few cases where the test failed with out of memory allocation which didn't happen on the thrust 64-bit code. But in either case we are not touching memory we should not be so that is a win. I would rather have an exception I can catch and deal with than potentially data corruption. |
cpp/src/join/sort_merge_join.cu
Outdated
| larger_indices.begin(), | ||
| thrust::maximum<size_type>{}); | ||
|
|
||
| // Scatter in chunks to handle large arrays (> INT32_MAX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace the scatter + max scan with a single cub::DeviceCopy::Batched? It supports 64-bit operations (NVIDIA/cccl#50 (comment)). In fact, I think we can directly use the run-length decoding example (https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceCopy.html#snippet) that cccl provides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand the logic here. How scatter + max scan can be replaced by copying? Can you provide a pseudo code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ttnghia I've implemented the idea here: https://github.com/shrshi/cudf/blob/30374a137c51e9a7e3576a5bbd3b65d2aa9d1fca/cpp/src/join/sort_merge_join.cu#L210
Since cub::DeviceCopy::Batched copies from multiple source ranges, if we can provide a constant iterator for each of the source ranges, we would be achieving the same behavior as a scatter followed by max scan. That is, if we provide to the copying API a constant iterator pointing to each of the nonzero match indices, along with the correct output positions in larger_indices array, we would get the desired result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for the suggestion.
# Conflicts: # cpp/src/join/sort_merge_join.cu
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
…ssues Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
shrshi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work, LGTM!
PointKernel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@revans2 Did you test the latest commit? If so, it is safe to go. |
revans2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good. I reran the benchmark locally and I didn't see any thrust issues or walking off the end of memory. I still want to compare the results with the original code, but I think it is good to go now.
|
/merge |
…0960) Update sort merge join so it can output more than int32 output rows. Hash based joins can already do this, but sort merge join would get an illegal memory access error (or might end up corrupting data) I didn't add new tests because existing benchmarks would cover it. ``` JOIN_NVBENCH --rmm_mode async --benchmark 'high_multiplicity_inner_join' --axis 'Algorithm=SORT_MERGE' ``` When run on a 48 GiB RTX a6000 the benchmark would crash with ``` Run: [207/1260] high_multiplicity_inner_join [Device=0 Nullable=0 NullEquality=NULLS_EQUAL DataType=INT32 Algorithm=SORT_MERGE num_keys=2 left_size=10000000 right_size=100000 multiplicity=1000] /home/roberte/src/spark-rapids-jni/target/jni/cmake-build/_deps/nvbench-src/nvbench/detail/timestamps_kernel.cu:62: Cuda API call returned error: cudaErrorIllegalAddress: an illegal memory access was encountered Command: 'cudaHostUnregister(&m_host_timestamps)' ``` But after this change all of the benchmarks complete. I did some spot checks on benchmarks that ran with the fix and also without it to see if there was much overhead associated with the fix and it appears to be < 1% of the runtime. Authors: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Shruti Shivakumar (https://github.com/shrshi) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#20960

Description
Update sort merge join so it can output more than int32 output rows.
Hash based joins can already do this, but sort merge join would get an illegal memory access error (or might end up corrupting data)
I didn't add new tests because existing benchmarks would cover it.
When run on a 48 GiB RTX a6000 the benchmark would crash with
But after this change all of the benchmarks complete. I did some spot checks on benchmarks that ran with the fix and also without it to see if there was much overhead associated with the fix and it appears to be < 1% of the runtime.
Checklist