Skip to content

Conversation

@Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Oct 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Existing tests cover this change

Are there any user-facing changes?

No

@Weijun-H Weijun-H changed the title refactor: remove dependency on arrow_ord refactor: remove arrow-ord dependency in arrow-cast Oct 27, 2025
@vegarsti
Copy link
Contributor

Could you try running the benchmark in this PR #8710 and see what the difference is? I thought cast_array.slice would be doing a clone, but it's not, so this might be quite fast.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 27, 2025
@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from a6198b8 to c72af36 Compare October 27, 2025 17:02
@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

I just reviewed the benchmark in

and I think it looks good to go. I'll merge it in and then run the benchmarks on this PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H and @vegarsti

values_indexes.push(0);
let mut current_data = array.slice(0, 1).to_data();
for idx in 1..array.len() {
let next_data = array.slice(idx, 1).to_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is likely to be substantially slower than what partition does, but we can see what the benchmarks show


// Partition the array to identify runs of consecutive equal values
let partitions = partition(&[Arc::clone(cast_array)])?;
let mut run_ends = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked briefly at a profile for this function -- I think we could make it substantially faster by reducing allocatiosn with a pre-sized vector here (use partitions.count_ones() to know how many partitions are needed)

Screenshot 2025-10-27 at 3 05 53 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great idea!

Side note: How did you profile this, using samply (it looks like), cargo build --profile profiling, and ran e.g. a unit test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Instruments that was part of Mac XCode -- it is pretty sweet as it will do whole system profiling (fire it up and start recording and it gathers the info for all processes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed your suggestion as a PR here: #8716, maybe you can run the benchmark on that too? 😇

@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (70b24d1) to 62df32e diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast     main
-----                                                              -----------------------------------     ----
cast binary view to string                                         1.00     68.6±0.30µs        ? ?/sec     1.07     73.4±0.31µs        ? ?/sec
cast binary view to string view                                    1.23    115.8±0.39µs        ? ?/sec     1.00     93.9±0.32µs        ? ?/sec
cast binary view to wide string                                    1.15     74.4±0.28µs        ? ?/sec     1.00     64.8±0.34µs        ? ?/sec
cast date32 to date64 512                                          1.00    293.1±0.84ns        ? ?/sec     1.03    301.6±1.53ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.4±1.12ns        ? ?/sec     1.01    505.7±1.89ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    610.2±0.91ns        ? ?/sec     1.00    604.3±1.45ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.1±0.02µs        ? ?/sec     1.01      5.1±0.03µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.01      6.9±0.03µs        ? ?/sec     1.00      6.8±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.07     81.1±0.12ns        ? ?/sec     1.00     75.8±0.16ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.3±0.01µs        ? ?/sec     1.01      2.3±0.02µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.6±0.14µs        ? ?/sec     1.00     48.5±0.34µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.1±0.03µs        ? ?/sec     1.02     11.3±0.03µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.09     82.2±0.20ns        ? ?/sec     1.00     75.7±0.20ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.01µs        ? ?/sec     1.14      2.6±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec     1.02      2.8±0.01µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.3±3.82ns        ? ?/sec     1.00    316.7±0.93ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.01µs        ? ?/sec     1.00      3.0±0.01µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.5±2.23ns        ? ?/sec     1.00    376.8±0.63ns        ? ?/sec
cast dict to string view                                           1.00     52.3±0.21µs        ? ?/sec     1.03     53.8±0.12µs        ? ?/sec
cast f32 to string 512                                             1.06     19.1±0.68µs        ? ?/sec     1.00     18.1±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.8±0.05µs        ? ?/sec     1.04     22.6±0.07µs        ? ?/sec
cast float32 to int32 512                                          1.00   1564.8±3.69ns        ? ?/sec     1.00   1560.3±4.08ns        ? ?/sec
cast float64 to float32 512                                        1.01   1088.4±3.42ns        ? ?/sec     1.00   1077.6±5.59ns        ? ?/sec
cast float64 to uint64 512                                         1.01   1769.7±5.39ns        ? ?/sec     1.00   1754.0±2.46ns        ? ?/sec
cast i64 to string 512                                             1.02     14.7±0.12µs        ? ?/sec     1.00     14.4±0.04µs        ? ?/sec
cast int32 to float32 512                                          1.02   1065.8±2.83ns        ? ?/sec     1.00   1047.8±4.28ns        ? ?/sec
cast int32 to float64 512                                          1.01   1071.1±4.97ns        ? ?/sec     1.00   1056.6±2.00ns        ? ?/sec
cast int32 to int32 512                                            1.01    201.1±1.01ns        ? ?/sec     1.00    198.7±0.45ns        ? ?/sec
cast int32 to int64 512                                            1.00   1084.5±1.52ns        ? ?/sec     1.08   1167.5±4.21ns        ? ?/sec
cast int32 to uint32 512                                           1.03   1517.6±5.18ns        ? ?/sec     1.00   1466.4±3.60ns        ? ?/sec
cast int64 to int32 512                                            1.00   1562.5±2.39ns        ? ?/sec     1.08   1684.9±3.44ns        ? ?/sec
cast no runs of int32s to ree<int32>                               18.65  1452.8±3.65µs        ? ?/sec     1.00     77.9±0.40µs        ? ?/sec
cast runs of 10 string to ree<int32>                               83.20  1357.7±3.84µs        ? ?/sec     1.00     16.3±0.08µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             164.19  1339.8±2.22µs        ? ?/sec    1.00      8.2±0.04µs        ? ?/sec
cast string single run to ree<int32>                               57.34  1568.5±2.43µs        ? ?/sec     1.00     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.2±0.01µs        ? ?/sec     1.02      3.3±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     97.8±0.20ns        ? ?/sec     1.00     97.3±0.20ns        ? ?/sec
cast string view to dict                                           1.00    173.5±0.35µs        ? ?/sec     1.04    180.1±0.30µs        ? ?/sec
cast string view to string                                         1.00     48.2±0.11µs        ? ?/sec     1.02     49.1±0.52µs        ? ?/sec
cast string view to wide string                                    1.00     48.4±0.16µs        ? ?/sec     1.07     51.8±0.22µs        ? ?/sec
cast time32s to time32ms 512                                       1.01    288.3±1.04ns        ? ?/sec     1.00    285.8±0.40ns        ? ?/sec
cast time32s to time64us 512                                       1.01    292.1±0.30ns        ? ?/sec     1.00    290.6±0.86ns        ? ?/sec
cast time64ns to time32s 512                                       1.00    503.3±4.33ns        ? ?/sec     1.01    507.9±0.76ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.05    453.0±2.06ns        ? ?/sec     1.00    433.1±1.14ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.02µs        ? ?/sec     1.00      2.2±0.00µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.01    200.6±1.05ns        ? ?/sec     1.00    197.7±0.35ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.02µs        ? ?/sec     1.00     11.4±0.04µs        ? ?/sec
cast utf8 to date64 512                                            1.08     46.3±0.08µs        ? ?/sec     1.00     42.8±0.16µs        ? ?/sec
cast utf8 to f32                                                   1.00     11.5±0.09µs        ? ?/sec     1.01     11.7±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.00      5.6±0.01µs        ? ?/sec     1.00      5.6±0.01µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 27, 2025

cast no runs of int32s to ree 18.65 1452.8±3.65µs ? ?/sec 1.00 77.9±0.40µs ? ?/sec
cast runs of 10 string to ree 83.20 1357.7±3.84µs ? ?/sec 1.00 16.3±0.08µs ? ?/sec
cast runs of 1000 int32s to ree 164.19 1339.8±2.22µs ? ?/sec 1.00 8.2±0.04µs ? ?/sec
? cast string single run to ree 57.34 1568.5±2.43µs ? ?/sec 1.00 27.4±0.08µs ? ?/sec

As @vegarsti predicted, this PR appears to be quite a bit slower than using partition

@Weijun-H Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from 70b24d1 to fe208be Compare October 27, 2025 23:00
@Weijun-H
Copy link
Member Author

Weijun-H commented Oct 28, 2025

FYI @vegarsti , @alamb After several rounds of optimization, the current version delivers significant improvements over the previous one.

  • Type-specialized dispatch:
    compute_run_boundaries now routes each physical layout (boolean, primitive scalars, binary/string, etc.) to a dedicated helper, allowing most arrays to bypass the slow, generic ArrayData comparison path.
  • Chunked primitive scanning:
    The no-null primitive path uses scan_run_end, which compares 16 bytes at a time via u128 loads. When a chunk differs, it falls back to scalar iteration—reducing branches and bounds checks in the hot loop.
  • Targeted use of unsafe for performance:
    Tight loops leverage get_unchecked, from_raw_parts, and read_unaligned to eliminate redundant bounds and alignment checks. Each unsafe block includes detailed safety comments describing the invariants upheld.
  • Generic fallback:
    Less common types still rely on ArrayData equality but reuse the shared accumulator to produce consistent run and value outputs—without special-casing memory management.
cast string single run to ree<int32>
                        time:   [23.143 µs 23.180 µs 23.224 µs]
                        change: [−8.5926% −6.6138% −5.2622%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

cast runs of 10 string to ree<int32>
                        time:   [4.4857 µs 4.4924 µs 4.4999 µs]
                        change: [−35.582% −32.807% −30.598%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

cast runs of 1000 int32s to ree<int32>
                        time:   [1.9651 µs 1.9923 µs 2.0449 µs]
                        change: [−35.958% −34.582% −33.095%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

cast no runs of int32s to ree<int32>
                        time:   [27.745 µs 28.013 µs 28.291 µs]
                        change: [−27.957% −27.305% −26.645%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  14 (14.00%) high mild

@alamb
Copy link
Contributor

alamb commented Oct 28, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (f9fc4fe) to 6c3e588 diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 28, 2025

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.00     68.8±0.29µs        ? ?/sec    1.07     73.3±0.29µs        ? ?/sec
cast binary view to string view                                    1.24    115.9±0.41µs        ? ?/sec    1.00     93.4±0.16µs        ? ?/sec
cast binary view to wide string                                    1.14     73.8±0.32µs        ? ?/sec    1.00     64.9±0.27µs        ? ?/sec
cast date32 to date64 512                                          1.00    295.8±1.07ns        ? ?/sec    1.00    296.2±0.51ns        ? ?/sec
cast date64 to date32 512                                          1.03    512.0±4.08ns        ? ?/sec    1.00    499.2±0.86ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    609.4±2.59ns        ? ?/sec    1.00    605.9±3.07ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.02      5.2±0.03µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      6.8±0.09µs        ? ?/sec    1.00      6.8±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.6±0.15ns        ? ?/sec    1.00     76.0±0.08ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.01      2.3±0.00µs        ? ?/sec    1.00      2.3±0.01µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.5±0.18µs        ? ?/sec    1.00     48.3±0.06µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.5±0.08µs        ? ?/sec    1.05     12.1±0.08µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     75.7±0.11ns        ? ?/sec    1.00     75.5±0.13ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.03µs        ? ?/sec    1.13      2.6±0.02µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec    1.01      2.8±0.02µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.8±2.21ns        ? ?/sec    1.00    316.6±0.56ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.02µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.6±5.70ns        ? ?/sec    1.00    376.0±0.49ns        ? ?/sec
cast dict to string view                                           1.00     52.5±0.10µs        ? ?/sec    1.02     53.8±0.09µs        ? ?/sec
cast f32 to string 512                                             1.03     18.7±0.04µs        ? ?/sec    1.00     18.3±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.4±0.12µs        ? ?/sec    1.06     22.6±0.12µs        ? ?/sec
cast float32 to int32 512                                          1.01   1577.3±2.44ns        ? ?/sec    1.00   1567.4±1.85ns        ? ?/sec
cast float64 to float32 512                                        1.02   1110.7±3.13ns        ? ?/sec    1.00   1091.8±1.88ns        ? ?/sec
cast float64 to uint64 512                                         1.02   1773.8±1.71ns        ? ?/sec    1.00   1742.5±3.28ns        ? ?/sec
cast i64 to string 512                                             1.00     14.4±0.04µs        ? ?/sec    1.02     14.7±0.13µs        ? ?/sec
cast int32 to float32 512                                          1.00   1015.5±1.26ns        ? ?/sec    1.04   1054.3±2.03ns        ? ?/sec
cast int32 to float64 512                                          1.03   1088.9±3.95ns        ? ?/sec    1.00   1053.8±1.81ns        ? ?/sec
cast int32 to int32 512                                            1.12    223.6±0.53ns        ? ?/sec    1.00    198.8±0.20ns        ? ?/sec
cast int32 to int64 512                                            1.00   1096.9±0.99ns        ? ?/sec    1.06   1167.2±2.57ns        ? ?/sec
cast int32 to uint32 512                                           1.05   1531.4±4.62ns        ? ?/sec    1.00   1464.9±1.52ns        ? ?/sec
cast int64 to int32 512                                            1.00  1568.7±34.86ns        ? ?/sec    1.08   1688.6±2.08ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     56.3±0.10µs        ? ?/sec    1.36     76.7±0.18µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.3±0.02µs        ? ?/sec    1.72     16.0±0.07µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.9±0.01µs        ? ?/sec    2.13      8.2±0.02µs        ? ?/sec
cast string single run to ree<int32>                               1.00     23.8±0.08µs        ? ?/sec    1.15     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.3±0.00µs        ? ?/sec    1.00      3.3±0.00µs        ? ?/sec
cast string view to binary view                                    1.00     96.4±0.12ns        ? ?/sec    1.02     98.1±0.19ns        ? ?/sec
cast string view to dict                                           1.02    175.6±0.39µs        ? ?/sec    1.00    171.5±0.36µs        ? ?/sec
cast string view to string                                         1.00     48.4±0.10µs        ? ?/sec    1.01     48.9±0.08µs        ? ?/sec
cast string view to wide string                                    1.00     49.8±0.27µs        ? ?/sec    1.04     51.7±0.15µs        ? ?/sec
cast time32s to time32ms 512                                       1.02    290.8±0.91ns        ? ?/sec    1.00    285.2±0.44ns        ? ?/sec
cast time32s to time64us 512                                       1.04    302.1±0.60ns        ? ?/sec    1.00    289.8±0.55ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    508.7±1.38ns        ? ?/sec    1.00    501.2±5.97ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    434.9±1.91ns        ? ?/sec    1.01    440.4±5.99ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.00µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.13    224.0±0.58ns        ? ?/sec    1.00    197.9±0.57ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.12µs        ? ?/sec    1.00     11.3±0.02µs        ? ?/sec
cast utf8 to date64 512                                            1.00     42.6±0.44µs        ? ?/sec    1.00     42.8±0.26µs        ? ?/sec
cast utf8 to f32                                                   1.01     11.8±0.03µs        ? ?/sec    1.00     11.7±0.06µs        ? ?/sec
cast wide string to binary view 512                                1.02      5.7±0.01µs        ? ?/sec    1.00      5.6±0.01µs        ? ?/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting

3 participants