Skip to content

Conversation

@krakow10
Copy link
Contributor

@krakow10 krakow10 commented May 3, 2025

  • Fixes Unreliable Benchmarks #529
  • Substantial performance regression on Windows
  • Substantial performance improvement on Linux, possibly bugged

If there's no reason why InstanceBuilder.properties can't be directly assigned to Instance.properties, then let's do that. The same logic is applied to building the properties in rbx_xml. The "Miner's Haven/Deserialize" benchmark says throughput is increased by 30%. I was mistaken, the performance uplift was due to #529 comparing to a bad baseline run on master branch. There seems to be a performance disparity between Windows and Linux for this change.

@Dekkonot
Copy link
Member

Dekkonot commented May 3, 2025

For what it's worth, this was a performance decision because a lot of time is spent inserting into that hashmap: #467

On my machine the change in this PR actually degrades performance by quite a lot:
image

Are you able to replicate the boost in performance? What are your system specs like?

@krakow10
Copy link
Contributor Author

krakow10 commented May 4, 2025

Excellent, I assumed it must be that way for performance reasons, but I still don't understand how it's faster to build a vec and then build a hashmap instead of just directly building a hashmap. Is is because of rehashing while allocating more capacity? I'm running the benchmarks on Linux using a 5950X like I mentioned on #529.

I reran the benchmarks first on master and then on properties branch, here is the full output for the properties branch:

     Running benches/suite/main.rs (target/release/deps/suite-4220ebfc15de915e)
Gnuplot not found, using plotters backend
100 Folders/Serialize   time:   [20.673 µs 20.685 µs 20.699 µs]
                        thrpt:  [36.445 MiB/s 36.469 MiB/s 36.490 MiB/s]
                 change:
                        time:   [+0.2859% +0.3785% +0.4745%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4723% -0.3771% -0.2850%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
100 Folders/Deserialize time:   [34.843 µs 34.905 µs 34.985 µs]
                        thrpt:  [20.771 MiB/s 20.819 MiB/s 20.857 MiB/s]
                 change:
                        time:   [-6.5392% -6.4015% -6.2569%] (p = 0.00 < 0.05)
                        thrpt:  [+6.6745% +6.8393% +6.9967%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

100 Deep Folders/Serialize
                        time:   [22.916 µs 22.935 µs 22.952 µs]
                        thrpt:  [32.866 MiB/s 32.891 MiB/s 32.918 MiB/s]
                 change:
                        time:   [-2.3817% -2.0812% -1.8326%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8668% +2.1254% +2.4398%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
100 Deep Folders/Deserialize
                        time:   [40.367 µs 40.395 µs 40.423 µs]
                        thrpt:  [18.001 MiB/s 18.013 MiB/s 18.026 MiB/s]
                 change:
                        time:   [-5.9860% -5.8948% -5.8109%] (p = 0.00 < 0.05)
                        thrpt:  [+6.1694% +6.2640% +6.3671%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

100 100-line ModuleScripts/Serialize
                        time:   [87.191 µs 87.308 µs 87.453 µs]
                        thrpt:  [42.181 MiB/s 42.251 MiB/s 42.307 MiB/s]
                 change:
                        time:   [-0.9348% -0.0077% +1.3494%] (p = 0.98 > 0.05)
                        thrpt:  [-1.3314% +0.0077% +0.9436%]
                        No change in performance detected.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) high mild
  16 (16.00%) high severe
100 100-line ModuleScripts/Deserialize
                        time:   [148.15 µs 148.28 µs 148.43 µs]
                        thrpt:  [24.576 MiB/s 24.601 MiB/s 24.622 MiB/s]
                 change:
                        time:   [-4.6385% -4.1208% -3.6891%] (p = 0.00 < 0.05)
                        thrpt:  [+3.8305% +4.2979% +4.8642%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

1,000 Parts/Serialize   time:   [1.7785 ms 1.7821 ms 1.7863 ms]
                        thrpt:  [2.3432 MiB/s 2.3487 MiB/s 2.3535 MiB/s]
                 change:
                        time:   [+0.8976% +1.1602% +1.4339%] (p = 0.00 < 0.05)
                        thrpt:  [-1.4136% -1.1469% -0.8896%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
1,000 Parts/Deserialize time:   [1.9735 ms 1.9768 ms 1.9807 ms]
                        thrpt:  [2.0516 MiB/s 2.0556 MiB/s 2.0591 MiB/s]
                 change:
                        time:   [+6.9658% +7.2007% +7.4589%] (p = 0.00 < 0.05)
                        thrpt:  [-6.9412% -6.7170% -6.5122%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Benchmarking Miner's Haven/Serialize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 14.3s, or reduce sample count to 30.
Miner's Haven/Serialize time:   [142.46 ms 142.93 ms 143.44 ms]
                        thrpt:  [52.796 MiB/s 52.986 MiB/s 53.159 MiB/s]
                 change:
                        time:   [-9.0355% -8.5930% -8.1495%] (p = 0.00 < 0.05)
                        thrpt:  [+8.8726% +9.4008% +9.9330%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
Benchmarking Miner's Haven/Deserialize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 18.0s, or reduce sample count to 20.
Miner's Haven/Deserialize
                        time:   [178.69 ms 179.64 ms 180.70 ms]
                        thrpt:  [41.957 MiB/s 42.205 MiB/s 42.428 MiB/s]
                 change:
                        time:   [-26.671% -25.882% -25.149%] (p = 0.00 < 0.05)
                        thrpt:  [+33.600% +34.920% +36.371%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

Seems positive and reproducible.

@kennethloeffler
Copy link
Member

kennethloeffler commented May 5, 2025

This change overall degrades performance on my M1 MacBook Air. Here is the deserialize Miner's Haven benchmark plotted (n = 1000, red is latest master, blue is this branch):
image

I'm not entirely sure why there is such a significant difference, but I did some cursory profiling and noticed that these changes cause ~10% more L1 misses and ~12% more L2 misses when deserializing miners-haven.rbxl on my machine. So, one reason could be that using a scratch Vec for instance builder properties somehow improves cache locality over UstrMap (at least, on my machine and on this workload).

I highly recommend resolving #529 before moving on with any performance work! The benchmarks are only as good as the data they can collect - if the data is polluted with e.g. CPU boosts then it's much, much harder to get a good signal

@krakow10
Copy link
Contributor Author

krakow10 commented May 6, 2025

I looked up how to set the cpu performance governor, and it can be done using cpupower like so:

sudo cpupower frequency-set -g performance

Which supposedly just sets it to always run at the maximum frequency. The default governer was called "powersave".

Running benchmarks repeatedly on the properties branch has been consistent within 5% across 10 runs, but running 5 benchmarks on the master or entries branch both elicit the 70% variance issue, going by feeling I'd say it's fast 50% of the time and slow 50% of the time. The properties branch's consistent benchmark is 10% slower than master for the "fast" variant of the "1,000 Parts/Deserialize" benchmark.

Ustr uses precomputed hashes which I assume are the same for every hashmap. Could it be accidental quadratic behaviour when doing new_map = map.iter().collect();?

@Dekkonot
Copy link
Member

Tried this again to see if any of the other changes had magically fixed the performance regression but nope, this still regresses performance by a fair bit. I ran it around a dozen times and it was worse every time, so I'm still not sure what you're observing.

@krakow10
Copy link
Contributor Author

krakow10 commented Nov 19, 2025

I've updated the description to include pertinent information. The originally reported performance uplift was because I was comparing to a low-performance run of master branch. Compared to the high-performance variant, which is the expected master branch performance, this change introduces a substantial performance regression.

@krakow10
Copy link
Contributor Author

Closing after revisiting benchmarks with a new governor tool, see #529 (comment).

@krakow10 krakow10 closed this Nov 20, 2025
@krakow10 krakow10 deleted the properties branch November 20, 2025 01:57
@krakow10 krakow10 restored the properties branch November 20, 2025 04:21
@krakow10
Copy link
Contributor Author

krakow10 commented Nov 20, 2025

Okay, I can no longer repro #529, but I still see a performance improvement on this PR, upwards of 30% on the micro benchmarks! I'm more interested to see why there is a disparity between Windows and Linux / what the bug is than I am to have the actual changes merged. I even hit a deserialize unwrap during benchmarking after rebasing:

thread 'main' (485409) panicked at /home/quat/git/rbx-dom/rbx_binary/src/deserializer/state.rs:527:80:
called `Option::unwrap()` on a `None` value

which corresponds to this line:

let instance = self.instances_by_ref.get_mut(referent).unwrap();

I hit a segfault on the next run:

Benchmarking Miner's Haven/Deserialize: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 17.8s, or reduce sample count to 20.
Benchmarking Miner's Haven/Deserialize: Collecting 100 samples in estimated 17.803 s (100 iterations)error: bench failed, to rerun pass `-p rbx_binary --bench suite`

Caused by:
  process didn't exit successfully: `/home/quat/git/rbx-dom/target/release/deps/suite-4db239afe3b2da11 --bench` (signal: 11, SIGSEGV: invalid memory reference)

Very curious! Undefined behaviour!

First successful run:
image

(Nov 20)
I ran the benchmarks on the master branch today and my computer hard crashed and rebooted! What the hell!

(Nov 29)
I am once again getting #529, even with cpu-power-gui.

@krakow10 krakow10 reopened this Nov 20, 2025
@krakow10 krakow10 marked this pull request as draft November 21, 2025 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unreliable Benchmarks

3 participants