multithreading via Rayon #50

Shnatsel · 2025-11-30T15:38:55Z

Extend this to f32
~~Multi-thread the part that spans the two halves~~ Doesn't seem to be beneficial, difficult to measure due to noise
Feature-gate rayon dependency
Do something about the regression for smaller sizes in the planner

This reverts commit 5109c8d.

Shnatsel · 2025-11-30T15:41:03Z

The initial numbers are promising, with over 3x faster DiT at size 16777216. The gains for smaller sizes are smaller, and the smallest sizes that use this codepath regress.

codecov-commenter · 2025-11-30T15:43:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.82%. Comparing base (e3d75e5) to head (2d4130e).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #50      +/-   ##
==========================================
+ Coverage   99.29%   99.82%   +0.52%     
==========================================
  Files          12       13       +1     
  Lines        2277     2281       +4     
==========================================
+ Hits         2261     2277      +16     
+ Misses         16        4      -12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…halves" This reverts commit 42cb42b.

…s by eliminating the thread spawning and termination overhead

…awning overhead is no longer a concern; benchmarks show improvement even at 15 on my machine but I'm being conservative for now, we'll need to auto-tune COBRA in the future because hardware varies so much

Shnatsel · 2025-11-30T20:33:30Z

With rayon thread pool also being used for COBRA instead of spawning brand new threads, sizes 131072 and above are 2x faster compared to main, and our largest size 16777216 is 2.5x faster. Admittedly my CPU has a lot of threads so the gains may be less pronounced on lower core counts.

@smu160 I'd appreciate benchmarks on ARM

…'t have to stop testing README snippets that Rust treats as doctest

Shnatsel · 2025-12-01T20:22:16Z

So, fun fact: disabling rayon feature and statically knowing that multi-threading for COBRA will not be used improves benchmarks for the smallest size considerably:

Forward f32/PhastFT DIT/64
                        time:   [156.70 ns 161.32 ns 164.25 ns]
                        thrpt:  [389.66 Melem/s 396.73 Melem/s 408.42 Melem/s]
                        thrpt:  [1.4516 GiB/s 1.4779 GiB/s 1.5215 GiB/s]
                 change:
                        time:   [−29.253% −26.600% −23.937%] (p = 0.00 < 0.05)
                        thrpt:  [+31.469% +36.240% +41.348%]
                        Performance has improved.

Not sure if it's a code layout artifact or the result of more inlining or what. We never had the chance to see these gains before because std::thread::scope codepath was compiled unconditionally.

I'm not going to do anything about it just yet, but it is something to keep in mind for the future.

…hmarks show that it regresses small sizes far less, but not enough to break even with the single-threaded implementation; but also regresses large sizes a lot. So it loses out to both rayon and single-threaded depending on the size and doesn't seem to be worth it.

…on. Benchmarks show that it regresses small sizes far less, but not enough to break even with the single-threaded implementation; but also regresses large sizes a lot. So it loses out to both rayon and single-threaded depending on the size and doesn't seem to be worth it." This reverts commit d703fe3.

… not be split further

smu160 · 2025-12-03T00:56:52Z

Results on apple m2

Shnatsel added 3 commits November 30, 2025 14:04

proof-of-concept Rayon integration

c438cf1

Experimental cross-half parallelization; regressed benchmarks

5109c8d

Revert "Experimental cross-half parallelization; regressed benchmarks"

ac1cf6a

This reverts commit 5109c8d.

Merge branch 'as-chunks' into rayon-as-chunks

e96c83c

Shnatsel mentioned this pull request Nov 30, 2025

Use .as_chunks_mut() instead of .chunks_exact_mut() for better performance #51

Merged

Shnatsel added 9 commits November 30, 2025 17:26

Use parallel kernel only for the largest sizes when crossing halves

42cb42b

Revert "Use parallel kernel only for the largest sizes when crossing …

1afda6a

…halves" This reverts commit 42cb42b.

Merge branch 'main' into rayon-as-chunks

2322ec7

Use rayon for parallelizing COBRA, helps performance of mid-sized FFT…

3e27178

…s by eliminating the thread spawning and termination overhead

Dramatically lower COBRA multi-threading threshold now that thread sp…

3e4c689

…awning overhead is no longer a concern; benchmarks show improvement even at 15 on my machine but I'm being conservative for now, we'll need to auto-tune COBRA in the future because hardware varies so much

Merge branch 'main' into rayon

df82993

cargo fmt

0532671

Merge branch 'main' into rayon

53a09a4

Merge branch 'main' into rayon

f840323

Shnatsel added 7 commits December 1, 2025 18:26

Add Rayon to the f32 codepath

535a457

Make rayon dependency optional

c179bda

Rename parallel_join to something less technical

ddbcb27

Run tests both with and without all the features

e5bdf13

Commit a file that I forgot to git add

94fcf91

test with complex-nums feature but without parallelism so that we don…

a9165c7

…'t have to stop testing README snippets that Rust treats as doctest

Fix YAML syntax

65f9ff2

Shnatsel added 4 commits December 2, 2025 10:20

Add a tunable for the smallest chunk size beyond which the input will…

5939d88

… not be split further

Update doc comments for rayon

2d4130e

Shnatsel changed the title ~~PoC: multithreading via Rayon~~ multithreading via Rayon Dec 2, 2025

Shnatsel marked this pull request as ready for review December 2, 2025 12:42

Shnatsel merged commit 986d403 into main Dec 3, 2025
10 checks passed

Shnatsel deleted the rayon branch December 3, 2025 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multithreading via Rayon #50

multithreading via Rayon #50

Shnatsel commented Nov 30, 2025 •

edited

Loading

Uh oh!

Shnatsel commented Nov 30, 2025

Uh oh!

codecov-commenter commented Nov 30, 2025 •

edited

Loading

Uh oh!

Shnatsel commented Nov 30, 2025

Uh oh!

Shnatsel commented Dec 1, 2025

Uh oh!

smu160 commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

multithreading via Rayon #50

multithreading via Rayon #50

Conversation

Shnatsel commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shnatsel commented Nov 30, 2025

Uh oh!

codecov-commenter commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Shnatsel commented Nov 30, 2025

Uh oh!

Shnatsel commented Dec 1, 2025

Uh oh!

smu160 commented Dec 3, 2025

Results on apple m2

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shnatsel commented Nov 30, 2025 •

edited

Loading

codecov-commenter commented Nov 30, 2025 •

edited

Loading