Skip to content

Conversation

@Shnatsel
Copy link
Collaborator

@Shnatsel Shnatsel commented Nov 30, 2025

  • Extend this to f32
  • Multi-thread the part that spans the two halves Doesn't seem to be beneficial, difficult to measure due to noise
  • Feature-gate rayon dependency
  • Do something about the regression for smaller sizes in the planner

@Shnatsel
Copy link
Collaborator Author

The initial numbers are promising, with over 3x faster DiT at size 16777216. The gains for smaller sizes are smaller, and the smallest sizes that use this codepath regress.

@codecov-commenter
Copy link

codecov-commenter commented Nov 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.82%. Comparing base (e3d75e5) to head (2d4130e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #50      +/-   ##
==========================================
+ Coverage   99.29%   99.82%   +0.52%     
==========================================
  Files          12       13       +1     
  Lines        2277     2281       +4     
==========================================
+ Hits         2261     2277      +16     
+ Misses         16        4      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Shnatsel
Copy link
Collaborator Author

With rayon thread pool also being used for COBRA instead of spawning brand new threads, sizes 131072 and above are 2x faster compared to main, and our largest size 16777216 is 2.5x faster. Admittedly my CPU has a lot of threads so the gains may be less pronounced on lower core counts.

@smu160 I'd appreciate benchmarks on ARM

@Shnatsel
Copy link
Collaborator Author

Shnatsel commented Dec 1, 2025

So, fun fact: disabling rayon feature and statically knowing that multi-threading for COBRA will not be used improves benchmarks for the smallest size considerably:

Forward f32/PhastFT DIT/64
                        time:   [156.70 ns 161.32 ns 164.25 ns]
                        thrpt:  [389.66 Melem/s 396.73 Melem/s 408.42 Melem/s]
                        thrpt:  [1.4516 GiB/s 1.4779 GiB/s 1.5215 GiB/s]
                 change:
                        time:   [−29.253% −26.600% −23.937%] (p = 0.00 < 0.05)
                        thrpt:  [+31.469% +36.240% +41.348%]
                        Performance has improved.

Not sure if it's a code layout artifact or the result of more inlining or what. We never had the chance to see these gains before because std::thread::scope codepath was compiled unconditionally.

I'm not going to do anything about it just yet, but it is something to keep in mind for the future.

…hmarks show that it regresses small sizes far less, but not enough to break even with the single-threaded implementation; but also regresses large sizes a lot. So it loses out to both rayon and single-threaded depending on the size and doesn't seem to be worth it.
…on. Benchmarks show that it regresses small sizes far less, but not enough to break even with the single-threaded implementation; but also regresses large sizes a lot. So it loses out to both rayon and single-threaded depending on the size and doesn't seem to be worth it."

This reverts commit d703fe3.
@Shnatsel Shnatsel changed the title PoC: multithreading via Rayon multithreading via Rayon Dec 2, 2025
@Shnatsel Shnatsel marked this pull request as ready for review December 2, 2025 12:42
@smu160
Copy link
Member

smu160 commented Dec 3, 2025

Results on apple m2

Screenshot 2025-12-02 at 7 56 38 PM Screenshot 2025-12-02 at 7 56 24 PM Screenshot 2025-12-02 at 7 55 32 PM Screenshot 2025-12-02 at 7 55 18 PM

@Shnatsel Shnatsel merged commit 986d403 into main Dec 3, 2025
10 checks passed
@Shnatsel Shnatsel deleted the rayon branch December 3, 2025 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants