refactor: remove `arrow-ord` dependency in `arrow-cast` #8716

Weijun-H · 2025-10-27T09:16:11Z

Which issue does this PR close?

Closes Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting #8708

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Existing tests cover this change

Are there any user-facing changes?

No

vegarsti · 2025-10-27T11:40:42Z

Could you try running the benchmark in this PR #8710 and see what the difference is? I thought cast_array.slice would be doing a clone, but it's not, so this might be quite fast.

alamb · 2025-10-27T19:08:32Z

I just reviewed the benchmark in

Add benchmark for casting to RunEndEncoded (REE) #8710

and I think it looks good to go. I'll merge it in and then run the benchmarks on this PR

alamb

Thanks @Weijun-H and @vegarsti

alamb · 2025-10-27T19:10:06Z

arrow-cast/src/cast/run_array.rs

+    values_indexes.push(0);
+    let mut current_data = array.slice(0, 1).to_data();
+    for idx in 1..array.len() {
+        let next_data = array.slice(idx, 1).to_data();


I think this is likely to be substantially slower than what partition does, but we can see what the benchmarks show

alamb · 2025-10-27T19:11:23Z

arrow-cast/src/cast/run_array.rs


-    // Partition the array to identify runs of consecutive equal values
-    let partitions = partition(&[Arc::clone(cast_array)])?;
-    let mut run_ends = Vec::new();


I looked briefly at a profile for this function -- I think we could make it substantially faster by reducing allocatiosn with a pre-sized vector here (use partitions.count_ones() to know how many partitions are needed)

Oh, great idea!

Side note: How did you profile this, using samply (it looks like), cargo build --profile profiling, and ran e.g. a unit test?

I used Instruments that was part of Mac XCode -- it is pretty sweet as it will do whole system profiling (fire it up and start recording and it gathers the info for all processes)

Pushed your suggestion as a PR here: #8716, maybe you can run the benchmark on that too? 😇

alamb · 2025-10-27T19:24:59Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (70b24d1) to 62df32e diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

alamb · 2025-10-27T19:43:08Z

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast     main
-----                                                              -----------------------------------     ----
cast binary view to string                                         1.00     68.6±0.30µs        ? ?/sec     1.07     73.4±0.31µs        ? ?/sec
cast binary view to string view                                    1.23    115.8±0.39µs        ? ?/sec     1.00     93.9±0.32µs        ? ?/sec
cast binary view to wide string                                    1.15     74.4±0.28µs        ? ?/sec     1.00     64.8±0.34µs        ? ?/sec
cast date32 to date64 512                                          1.00    293.1±0.84ns        ? ?/sec     1.03    301.6±1.53ns        ? ?/sec
cast date64 to date32 512                                          1.00    501.4±1.12ns        ? ?/sec     1.01    505.7±1.89ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    610.2±0.91ns        ? ?/sec     1.00    604.3±1.45ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.1±0.02µs        ? ?/sec     1.01      5.1±0.03µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.01      6.9±0.03µs        ? ?/sec     1.00      6.8±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.07     81.1±0.12ns        ? ?/sec     1.00     75.8±0.16ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.3±0.01µs        ? ?/sec     1.01      2.3±0.02µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.6±0.14µs        ? ?/sec     1.00     48.5±0.34µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.1±0.03µs        ? ?/sec     1.02     11.3±0.03µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.09     82.2±0.20ns        ? ?/sec     1.00     75.7±0.20ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.01µs        ? ?/sec     1.14      2.6±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec     1.02      2.8±0.01µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.3±3.82ns        ? ?/sec     1.00    316.7±0.93ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.01µs        ? ?/sec     1.00      3.0±0.01µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.5±2.23ns        ? ?/sec     1.00    376.8±0.63ns        ? ?/sec
cast dict to string view                                           1.00     52.3±0.21µs        ? ?/sec     1.03     53.8±0.12µs        ? ?/sec
cast f32 to string 512                                             1.06     19.1±0.68µs        ? ?/sec     1.00     18.1±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.8±0.05µs        ? ?/sec     1.04     22.6±0.07µs        ? ?/sec
cast float32 to int32 512                                          1.00   1564.8±3.69ns        ? ?/sec     1.00   1560.3±4.08ns        ? ?/sec
cast float64 to float32 512                                        1.01   1088.4±3.42ns        ? ?/sec     1.00   1077.6±5.59ns        ? ?/sec
cast float64 to uint64 512                                         1.01   1769.7±5.39ns        ? ?/sec     1.00   1754.0±2.46ns        ? ?/sec
cast i64 to string 512                                             1.02     14.7±0.12µs        ? ?/sec     1.00     14.4±0.04µs        ? ?/sec
cast int32 to float32 512                                          1.02   1065.8±2.83ns        ? ?/sec     1.00   1047.8±4.28ns        ? ?/sec
cast int32 to float64 512                                          1.01   1071.1±4.97ns        ? ?/sec     1.00   1056.6±2.00ns        ? ?/sec
cast int32 to int32 512                                            1.01    201.1±1.01ns        ? ?/sec     1.00    198.7±0.45ns        ? ?/sec
cast int32 to int64 512                                            1.00   1084.5±1.52ns        ? ?/sec     1.08   1167.5±4.21ns        ? ?/sec
cast int32 to uint32 512                                           1.03   1517.6±5.18ns        ? ?/sec     1.00   1466.4±3.60ns        ? ?/sec
cast int64 to int32 512                                            1.00   1562.5±2.39ns        ? ?/sec     1.08   1684.9±3.44ns        ? ?/sec
cast no runs of int32s to ree<int32>                               18.65  1452.8±3.65µs        ? ?/sec     1.00     77.9±0.40µs        ? ?/sec
cast runs of 10 string to ree<int32>                               83.20  1357.7±3.84µs        ? ?/sec     1.00     16.3±0.08µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             164.19  1339.8±2.22µs        ? ?/sec    1.00      8.2±0.04µs        ? ?/sec
cast string single run to ree<int32>                               57.34  1568.5±2.43µs        ? ?/sec     1.00     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.2±0.01µs        ? ?/sec     1.02      3.3±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     97.8±0.20ns        ? ?/sec     1.00     97.3±0.20ns        ? ?/sec
cast string view to dict                                           1.00    173.5±0.35µs        ? ?/sec     1.04    180.1±0.30µs        ? ?/sec
cast string view to string                                         1.00     48.2±0.11µs        ? ?/sec     1.02     49.1±0.52µs        ? ?/sec
cast string view to wide string                                    1.00     48.4±0.16µs        ? ?/sec     1.07     51.8±0.22µs        ? ?/sec
cast time32s to time32ms 512                                       1.01    288.3±1.04ns        ? ?/sec     1.00    285.8±0.40ns        ? ?/sec
cast time32s to time64us 512                                       1.01    292.1±0.30ns        ? ?/sec     1.00    290.6±0.86ns        ? ?/sec
cast time64ns to time32s 512                                       1.00    503.3±4.33ns        ? ?/sec     1.01    507.9±0.76ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.05    453.0±2.06ns        ? ?/sec     1.00    433.1±1.14ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.02µs        ? ?/sec     1.00      2.2±0.00µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.01    200.6±1.05ns        ? ?/sec     1.00    197.7±0.35ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.02µs        ? ?/sec     1.00     11.4±0.04µs        ? ?/sec
cast utf8 to date64 512                                            1.08     46.3±0.08µs        ? ?/sec     1.00     42.8±0.16µs        ? ?/sec
cast utf8 to f32                                                   1.00     11.5±0.09µs        ? ?/sec     1.01     11.7±0.03µs        ? ?/sec
cast wide string to binary view 512                                1.00      5.6±0.01µs        ? ?/sec     1.00      5.6±0.01µs        ? ?/sec

alamb · 2025-10-27T19:48:03Z

cast no runs of int32s to ree 18.65 1452.8±3.65µs ? ?/sec 1.00 77.9±0.40µs ? ?/sec
cast runs of 10 string to ree 83.20 1357.7±3.84µs ? ?/sec 1.00 16.3±0.08µs ? ?/sec
cast runs of 1000 int32s to ree 164.19 1339.8±2.22µs ? ?/sec 1.00 8.2±0.04µs ? ?/sec
? cast string single run to ree 57.34 1568.5±2.43µs ? ?/sec 1.00 27.4±0.08µs ? ?/sec

As @vegarsti predicted, this PR appears to be quite a bit slower than using partition

Weijun-H · 2025-10-28T08:22:26Z

FYI @vegarsti , @alamb After several rounds of optimization, the current version delivers significant improvements over the previous one.

Type-specialized dispatch:
compute_run_boundaries now routes each physical layout (boolean, primitive scalars, binary/string, etc.) to a dedicated helper, allowing most arrays to bypass the slow, generic ArrayData comparison path.
Chunked primitive scanning:
The no-null primitive path uses scan_run_end, which compares 16 bytes at a time via u128 loads. When a chunk differs, it falls back to scalar iteration—reducing branches and bounds checks in the hot loop.
Targeted use of unsafe for performance:
Tight loops leverage get_unchecked, from_raw_parts, and read_unaligned to eliminate redundant bounds and alignment checks. Each unsafe block includes detailed safety comments describing the invariants upheld.
Generic fallback:
Less common types still rely on ArrayData equality but reuse the shared accumulator to produce consistent run and value outputs—without special-casing memory management.

cast string single run to ree<int32>
                        time:   [23.143 µs 23.180 µs 23.224 µs]
                        change: [−8.5926% −6.6138% −5.2622%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

cast runs of 10 string to ree<int32>
                        time:   [4.4857 µs 4.4924 µs 4.4999 µs]
                        change: [−35.582% −32.807% −30.598%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

cast runs of 1000 int32s to ree<int32>
                        time:   [1.9651 µs 1.9923 µs 2.0449 µs]
                        change: [−35.958% −34.582% −33.095%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

cast no runs of int32s to ree<int32>
                        time:   [27.745 µs 28.013 µs 28.291 µs]
                        change: [−27.957% −27.305% −26.645%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  14 (14.00%) high mild

alamb · 2025-10-28T19:33:30Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8708-remove-arrow-ord-in-arrow-cast (f9fc4fe) to 6c3e588 diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=8708-remove-arrow-ord-in-arrow-cast
Results will be posted here when complete

alamb · 2025-10-28T19:51:10Z

🤖: Benchmark completed

Details

group                                                              8708-remove-arrow-ord-in-arrow-cast    main
-----                                                              -----------------------------------    ----
cast binary view to string                                         1.00     68.8±0.29µs        ? ?/sec    1.07     73.3±0.29µs        ? ?/sec
cast binary view to string view                                    1.24    115.9±0.41µs        ? ?/sec    1.00     93.4±0.16µs        ? ?/sec
cast binary view to wide string                                    1.14     73.8±0.32µs        ? ?/sec    1.00     64.9±0.27µs        ? ?/sec
cast date32 to date64 512                                          1.00    295.8±1.07ns        ? ?/sec    1.00    296.2±0.51ns        ? ?/sec
cast date64 to date32 512                                          1.03    512.0±4.08ns        ? ?/sec    1.00    499.2±0.86ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.01    609.4±2.59ns        ? ?/sec    1.00    605.9±3.07ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.02      5.2±0.03µs        ? ?/sec    1.00      5.1±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      6.8±0.09µs        ? ?/sec    1.00      6.8±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.6±0.15ns        ? ?/sec    1.00     76.0±0.08ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.01      2.3±0.00µs        ? ?/sec    1.00      2.3±0.01µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.5±0.18µs        ? ?/sec    1.00     48.3±0.06µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     11.5±0.08µs        ? ?/sec    1.05     12.1±0.08µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     75.7±0.11ns        ? ?/sec    1.00     75.5±0.13ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.03µs        ? ?/sec    1.13      2.6±0.02µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      2.8±0.03µs        ? ?/sec    1.01      2.8±0.02µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.10    347.8±2.21ns        ? ?/sec    1.00    316.6±0.56ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.01      3.0±0.02µs        ? ?/sec    1.00      3.0±0.00µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.01    381.6±5.70ns        ? ?/sec    1.00    376.0±0.49ns        ? ?/sec
cast dict to string view                                           1.00     52.5±0.10µs        ? ?/sec    1.02     53.8±0.09µs        ? ?/sec
cast f32 to string 512                                             1.03     18.7±0.04µs        ? ?/sec    1.00     18.3±0.04µs        ? ?/sec
cast f64 to string 512                                             1.00     21.4±0.12µs        ? ?/sec    1.06     22.6±0.12µs        ? ?/sec
cast float32 to int32 512                                          1.01   1577.3±2.44ns        ? ?/sec    1.00   1567.4±1.85ns        ? ?/sec
cast float64 to float32 512                                        1.02   1110.7±3.13ns        ? ?/sec    1.00   1091.8±1.88ns        ? ?/sec
cast float64 to uint64 512                                         1.02   1773.8±1.71ns        ? ?/sec    1.00   1742.5±3.28ns        ? ?/sec
cast i64 to string 512                                             1.00     14.4±0.04µs        ? ?/sec    1.02     14.7±0.13µs        ? ?/sec
cast int32 to float32 512                                          1.00   1015.5±1.26ns        ? ?/sec    1.04   1054.3±2.03ns        ? ?/sec
cast int32 to float64 512                                          1.03   1088.9±3.95ns        ? ?/sec    1.00   1053.8±1.81ns        ? ?/sec
cast int32 to int32 512                                            1.12    223.6±0.53ns        ? ?/sec    1.00    198.8±0.20ns        ? ?/sec
cast int32 to int64 512                                            1.00   1096.9±0.99ns        ? ?/sec    1.06   1167.2±2.57ns        ? ?/sec
cast int32 to uint32 512                                           1.05   1531.4±4.62ns        ? ?/sec    1.00   1464.9±1.52ns        ? ?/sec
cast int64 to int32 512                                            1.00  1568.7±34.86ns        ? ?/sec    1.08   1688.6±2.08ns        ? ?/sec
cast no runs of int32s to ree<int32>                               1.00     56.3±0.10µs        ? ?/sec    1.36     76.7±0.18µs        ? ?/sec
cast runs of 10 string to ree<int32>                               1.00      9.3±0.02µs        ? ?/sec    1.72     16.0±0.07µs        ? ?/sec
cast runs of 1000 int32s to ree<int32>                             1.00      3.9±0.01µs        ? ?/sec    2.13      8.2±0.02µs        ? ?/sec
cast string single run to ree<int32>                               1.00     23.8±0.08µs        ? ?/sec    1.15     27.4±0.08µs        ? ?/sec
cast string to binary view 512                                     1.00      3.3±0.00µs        ? ?/sec    1.00      3.3±0.00µs        ? ?/sec
cast string view to binary view                                    1.00     96.4±0.12ns        ? ?/sec    1.02     98.1±0.19ns        ? ?/sec
cast string view to dict                                           1.02    175.6±0.39µs        ? ?/sec    1.00    171.5±0.36µs        ? ?/sec
cast string view to string                                         1.00     48.4±0.10µs        ? ?/sec    1.01     48.9±0.08µs        ? ?/sec
cast string view to wide string                                    1.00     49.8±0.27µs        ? ?/sec    1.04     51.7±0.15µs        ? ?/sec
cast time32s to time32ms 512                                       1.02    290.8±0.91ns        ? ?/sec    1.00    285.2±0.44ns        ? ?/sec
cast time32s to time64us 512                                       1.04    302.1±0.60ns        ? ?/sec    1.00    289.8±0.55ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    508.7±1.38ns        ? ?/sec    1.00    501.2±5.97ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.00    434.9±1.91ns        ? ?/sec    1.01    440.4±5.99ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.06      2.3±0.00µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.13    224.0±0.58ns        ? ?/sec    1.00    197.9±0.57ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.4±0.12µs        ? ?/sec    1.00     11.3±0.02µs        ? ?/sec
cast utf8 to date64 512                                            1.00     42.6±0.44µs        ? ?/sec    1.00     42.8±0.26µs        ? ?/sec
cast utf8 to f32                                                   1.01     11.8±0.03µs        ? ?/sec    1.00     11.7±0.06µs        ? ?/sec
cast wide string to binary view 512                                1.02      5.7±0.01µs        ? ?/sec    1.00      5.6±0.01µs        ? ?/sec

Weijun-H changed the title ~~refactor: remove dependency on arrow_ord~~ refactor: remove arrow-ord dependency in arrow-cast Oct 27, 2025

github-actions bot added the arrow Changes to the arrow crate label Oct 27, 2025

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from a6198b8 to c72af36 Compare October 27, 2025 17:02

alamb reviewed Oct 27, 2025

View reviewed changes

refactor: remove dependency on arrow_ord

26c8dc6

vegarsti mentioned this pull request Oct 27, 2025

perf: Use Vec::with_capacity in cast_to_run_end_encoded #8726

Open

chore

fe208be

Weijun-H force-pushed the 8708-remove-arrow-ord-in-arrow-cast branch from 70b24d1 to fe208be Compare October 27, 2025 23:00

chore: Added comments

f9fc4fe

Uh oh!

refactor: remove arrow-ord dependency in arrow-cast #8716

Are you sure you want to change the base?

refactor: remove arrow-ord dependency in arrow-cast #8716

Conversation

Weijun-H commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

vegarsti commented Oct 27, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

alamb commented Oct 27, 2025

Uh oh!

Weijun-H commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Oct 28, 2025

Uh oh!

alamb commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor: remove `arrow-ord` dependency in `arrow-cast` #8716

refactor: remove `arrow-ord` dependency in `arrow-cast` #8716

Weijun-H commented Oct 27, 2025 •

edited

Loading

Weijun-H commented Oct 28, 2025 •

edited

Loading