-
Notifications
You must be signed in to change notification settings - Fork 13
Optimising cobra_apply
#47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
+ Coverage 99.16% 99.26% +0.09%
==========================================
Files 12 12
Lines 2167 2165 -2
==========================================
Hits 2149 2149
+ Misses 18 16 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
On my Zen 4 CPU this is a consistent regression in the default configuration and makes little difference with cargo bench --bench=bit_reversalRUSTFLAGS='-C target-cpu=native' cargo bench --bench=bit_reversalOn what hardware did you measure it? |
|
Interesting! CPU is AMD Ryzen™ 5 5625U with Radeon™ Graphics × 12. That said, I didn't set the target-cpu... |
|
Hmm. Rust version? Mine is rustc 1.91.1 (ed61e7d7e 2025-11-07) |
|
rust 1.91.0 The +/- %s might be bit messed up because of the order I ran the benches, but the absolute numbers show a stark benefit of the LUT! With LUT, target-cpu=native
Benchmarking cobra_apply/cobra/15: Collecting 100 samples in estimated 5.1372 s (81k i
cobra_apply/cobra/15 time: [64.174 µs 64.578 µs 65.008 µs]
change: [−21.340% −20.823% −20.346%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking cobra_apply/cobra/16: Collecting 100 samples in estimated 5.3958 s (45k i
cobra_apply/cobra/16 time: [118.73 µs 119.12 µs 119.51 µs]
change: [−24.808% −24.492% −24.191%] (p = 0.00 < 0.05)
Performance has improved.
Benchmarking cobra_apply/cobra/17: Collecting 100 samples in estimated 6.0103 s (25k i
cobra_apply/cobra/17 time: [237.44 µs 237.76 µs 238.10 µs]
change: [−25.357% −24.942% −24.548%] (p = 0.00 < 0.05)
Performance has improved.
Benchmarking cobra_apply/cobra/18: Collecting 100 samples in estimated 6.9839 s (15k i
cobra_apply/cobra/18 time: [465.76 µs 468.96 µs 472.81 µs]
change: [−27.233% −26.014% −25.090%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Benchmarking cobra_apply/cobra/19: Collecting 100 samples in estimated 9.8303 s (10k i
cobra_apply/cobra/19 time: [953.42 µs 957.66 µs 962.12 µs]
change: [−23.881% −23.161% −22.503%] (p = 0.00 < 0.05)
Performance has improved.
Without LUT, target-cpu=native
Benchmarking cobra_apply/cobra/15: Collecting 100 samples in estimated 5.2045 s (50k i
cobra_apply/cobra/15 time: [103.51 µs 103.85 µs 104.23 µs]
change: [+60.409% +61.184% +62.011%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Benchmarking cobra_apply/cobra/16: Collecting 100 samples in estimated 5.2254 s (25k i
cobra_apply/cobra/16 time: [201.28 µs 201.65 µs 202.05 µs]
change: [+68.858% +69.429% +70.026%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking cobra_apply/cobra/17: Collecting 100 samples in estimated 6.2127 s (15k i
cobra_apply/cobra/17 time: [408.82 µs 409.32 µs 409.87 µs]
change: [+71.743% +72.192% +72.625%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
Benchmarking cobra_apply/cobra/18: Collecting 100 samples in estimated 8.3469 s (10k i
cobra_apply/cobra/18 time: [822.37 µs 823.99 µs 825.80 µs]
change: [+77.138% +77.935% +78.674%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking cobra_apply/cobra/19: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.3s, enable flat sampling, or reduce sample count to 50.
Benchmarking cobra_apply/cobra/19: Collecting 100 samples in estimated 8.3439 s (5050
cobra_apply/cobra/19 time: [1.6567 ms 1.6615 ms 1.6663 ms]
change: [+71.948% +72.926% +73.809%] (p = 0.00 < 0.05)
Performance has regressed.
With LUT, no RUSTFLAGS
cobra_apply/cobra/15 time: [76.037 µs 76.244 µs 76.491 µs]
change: [−28.284% −27.892% −27.445%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
Benchmarking cobra_apply/cobra/16: Collecting 100 samples in estimated 5.3447 s (35k i
cobra_apply/cobra/16 time: [150.24 µs 150.63 µs 151.06 µs]
change: [−29.339% −28.939% −28.615%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
Benchmarking cobra_apply/cobra/17: Collecting 100 samples in estimated 6.1029 s (20k i
cobra_apply/cobra/17 time: [304.55 µs 306.17 µs 307.95 µs]
change: [−29.238% −28.828% −28.363%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking cobra_apply/cobra/18: Collecting 100 samples in estimated 6.0233 s (10k i
cobra_apply/cobra/18 time: [593.32 µs 595.29 µs 597.30 µs]
change: [−31.614% −31.149% −30.778%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
Benchmarking cobra_apply/cobra/19: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
Benchmarking cobra_apply/cobra/19: Collecting 100 samples in estimated 6.0096 s (5050
cobra_apply/cobra/19 time: [1.1736 ms 1.1766 ms 1.1799 ms]
change: [−31.786% −31.555% −31.313%] (p = 0.00 < 0.05)
Performance has improved.
Without LUT, no RUSTFLAGS
Benchmarking cobra_apply/cobra/15: Collecting 100 samples in estimated 5.3685 s (50k i
cobra_apply/cobra/15 time: [106.26 µs 106.56 µs 106.90 µs]
change: [+2.3883% +2.8431% +3.2761%] (p = 0.00 < 0.05)
Performance has regressed.
Benchmarking cobra_apply/cobra/16: Collecting 100 samples in estimated 5.3756 s (25k i
cobra_apply/cobra/16 time: [211.63 µs 212.11 µs 212.79 µs]
change: [+4.9958% +5.4688% +6.1208%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
Benchmarking cobra_apply/cobra/17: Collecting 100 samples in estimated 6.4979 s (15k i
cobra_apply/cobra/17 time: [428.96 µs 429.87 µs 430.97 µs]
change: [+4.9453% +5.3419% +5.7541%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking cobra_apply/cobra/18: Collecting 100 samples in estimated 8.9149 s (10k i
cobra_apply/cobra/18 time: [863.06 µs 868.59 µs 876.94 µs]
change: [+4.2325% +4.7191% +5.3150%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
Benchmarking cobra_apply/cobra/19: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50.
Benchmarking cobra_apply/cobra/19: Collecting 100 samples in estimated 8.7387 s (5050
cobra_apply/cobra/19 time: [1.7152 ms 1.7169 ms 1.7188 ms]
change: [+3.1731% +3.4948% +3.8108%] (p = 0.00 < 0.05)
Performance has regressed.
|
|
Tip: to make percentages make sense, you can run
|
|
Ah that's useful. Criterion is one of these tools which I severely under
use. If I ever read the docs, I could probably figure out how to group the
with- and without-LUT variants into a single benchmark for much easier
direct comparison.
…On Sat, 15 Nov 2025, 16:09 Shnatsel, ***@***.***> wrote:
*Shnatsel* left a comment (QuState/PhastFT#47)
<#47 (comment)>
Tip: to make percentages make sense, you can run
cargo bench --bench=bit_reversal -- --save-baseline=main followed by cargo
bench --bench=bit_reversal -- --baseline=main and it will calculate
percentages relative to the baseline saved by the first command.
—
Reply to this email directly, view it on GitHub
<#47 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHSVKWAJFRYCGOFFIYM4GP3345F45AVCNFSM6AAAAACMGPWXK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMZWGYZTAMBUGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
I don't think there's a one-size-fits-all solution. If we want to reap these gains, we'll need to copy FFTW's design and measure the performance of various implementations at runtime, then select the fastest one. |
|
I've looked into COBRA some more and it's highly hardware-dependent: #49 We really do just need to start going down the FFTW route, measure the different variants in the planner and pick the best one for the hardware we're running on. It would be great to have your LUT-based version as one of the options. |
See #46