Skip to content

diffuse or sharpen CPU performance tuning#20468

Draft
kofa73 wants to merge 19 commits intodarktable-org:masterfrom
kofa73:diffuse-perf
Draft

diffuse or sharpen CPU performance tuning#20468
kofa73 wants to merge 19 commits intodarktable-org:masterfrom
kofa73:diffuse-perf

Conversation

@kofa73
Copy link
Contributor

@kofa73 kofa73 commented Mar 8, 2026

AI disclosure: The code changes in this PR were done by coding agents. I have monitored the progress and set rules (e.g. no gaming the benchmark).

regression a regression may have sneaked in, I'm on it (double-checking the setup and if needed bisecting the commits)

So: yes, there's a difference. However, whether it's a regression is debatable The stats:

          Max dE                   : 47.05293
           Avg dE                   : 0.04446
           Std dE                   : 0.17545
           ----------------------------------
           Pixels below avg + 0 std : 87.89 %
           Pixels below avg + 1 std : 90.88 %
           Pixels below avg + 3 std : 98.62 %
           Pixels below avg + 6 std : 99.84 %
           Pixels below avg + 9 std : 99.93 %
           ----------------------------------
           Pixels above tolerance   : 0.03 %

The average dE is low, 0.04446, and 87.89% of pixels have dE below the average. 0.03 % of pixels (in 24 MPx, about 7200) are above the threshold.

The reason for the difference is changed order of mathematical operations, which leads to tiny differences, which then get sharpened (remember, this sidecar contains all 21 diffuse or sharpen presets, most of them doing sharpening of some sort).

The files, so you can check this, are here (extremely noisy-looking, both of them, because of the many sharpening DoS instances). https://tech.kovacs-telekes.org/files/dt-pr-20468

The approach can be found in diffuse-opt-instructions.md. If you find it acceptable, I'll continue with clean-up and open a PR for the additional integration tests (the existing integration test did not catch one of the mistakes the agents made). In normal images (as demonstrated by the integration tests), we don't have this. I also did not get this using the image I used for performance measurements and to detect errors, and that has 'demoisac sharpen AA', 'lens deblurring strong' and one of the community-supplied sharpening presets.

The diff as 16-bit PNG (difference calculated in linear mode, PNG has gamma):

I'm happy to provide any details needed to assess if we want to move forward. Also, if we decide this is a good use of coding agents, I can repeat this exercise with the OpenCL path, as well as for other modules.

Note that despite the numerous commits, the first one brought most of the improvements, so it may make sense to only use the first few commits (see diffuse-performance.log).

Below is a measurement (only done once; the actual tuning process used a measurement loop for accuracy) to give you an idea about the results. The code was run (Release build using build.sh) on a 24 MPx image, with a history stack that contained all presets of diffuse or sharpen shipped with darktable (I hope I got them all).

Clean-up still in progress (I'll remove the files used in testing, the state tracking data, the instructions for the agents etc).

New measurement (measuring the minimum of each preset over a number of iterations, with new code, not all of which has been pushed yet):

preset original tuned speed-up (user) speed-up (CPU)
artistic effects / bloom took 1.530 secs (12.640 CPU) took 0.928 secs (6.292 CPU) 65% 101%
artistic effects / simulate line drawing took 70.128 secs (798.174 CPU) took 60.127 secs (663.304 CPU) 17% 20%
artistic effects / simulate watercolor took 5.752 secs (59.985 CPU) took 5.116 secs (51.501 CPU) 12% 16%
dehaze / extra contrast took 18.522 secs (204.144 CPU) took 14.021 secs (150.422 CPU) 32% 36%
dehaze / default took 18.565 secs (205.068 CPU) took 14.203 secs (150.615 CPU) 31% 36%
denoise / coarse took 32.974 secs (369.817 CPU) took 24.007 secs (258.988 CPU) 37% 43%
denoise / fine took 21.889 secs (245.658 CPU) took 15.957 secs (171.113 CPU) 37% 44%
denoise / medium took 27.473 secs (308.382 CPU) took 19.905 secs (215.217 CPU) 38% 43%
inpaint highlights took 12.490 secs (130.758 CPU) took 12.782 secs (128.914 CPU) -2% 1%
lens deblur / hard took 24.485 secs (276.125 CPU) took 17.875 secs (193.238 CPU) 37% 43%
lens deblur / medium took 16.564 secs (183.287 CPU) took 12.072 secs (128.458 CPU) 37% 43%
lens deblur / soft took 7.049 secs (76.117 CPU) took 5.103 secs (52.862 CPU) 38% 44%
local contrast / fast took 2.081 secs (18.038 CPU) took 1.678 secs (12.718 CPU) 24% 42%
local contrast / fine took 9.449 secs (99.398 CPU) took 7.926 secs (81.531 CPU) 19% 22%
local contrast / normal took 18.503 secs (201.509 CPU) took 15.435 secs (165.649 CPU) 20% 22%
sharpen demosaicing / AA filter took 1.108 secs (9.233 CPU) took 0.998 secs (7.468 CPU) 11% 24%
sharpen demosaicing / no AA filter took 0.875 secs (6.977 CPU) took 0.766 secs (5.704 CPU) 14% 22%
sharpness / fast took 1.848 secs (15.947 CPU) took 1.485 secs (11.110 CPU) 24% 44%
sharpness / normal took 2.360 secs (23.400 CPU) took 2.076 secs (19.028 CPU) 14% 23%
sharpness / strong took 4.377 secs (46.120 CPU) took 3.779 secs (37.746 CPU) 16% 22%
surface blur took 2.685 secs (26.367 CPU) took 2.310 secs (20.903 CPU) 16% 26%

@kofa73 kofa73 marked this pull request as draft March 8, 2026 09:31
kofa73 added 19 commits March 11, 2026 12:40
…ecution sum

Remove neighbour_pixel_HF[9] and neighbour_pixel_LF[9] stack arrays,
computing gradients, sums, and variance directly from HF/LF source
pointers using pre-computed offsets. Eliminates 72 float writes and
reads from stack, leveraging L1 cache spatial locality instead.
Compute cos²θ, sin²θ, and cosθ·sinθ directly from squared gradient
magnitude (gx²/m², gy²/m², gx·gy/m²) instead of normalizing the
gradient first then squaring. Reuses gx² and gy² already computed
for the magnitude, eliminating redundant squarings and one division
per gradient direction per channel.
…xecution sum

Replace 4 separate compute_convolution calls with 2 paired loops:
one for derivatives 0+2 (sharing gradient angle data) and one for
1+3 (sharing laplacian angle data). Each paired loop computes both
LF and HF convolutions in a single pass, halving loop overhead and
improving register reuse for the shared cos²θ/sin²θ/cosθsinθ values.
Each derivative uses its own isotropy_type for correctness.
New tests (0087-0090) cover all three isotropy modes and their
combinations. The critical test (0089-mixed) detects index-sharing
bugs in paired convolution loops — proven by reproducing the exact
bug found during the paired-convolution optimisation (Max dE 12.89
on buggy build vs 0.96 on correct build, while the original 0086
test passes both).

- create_diffuse_tests.py: generates XMP test configs
- diffuse-new-integration-tests.md: gap analysis and empirical proof
- diffuse-perf-test-files/deltae: local deltae using /tmp/testenv venv
Replace the k=0..3 outer loop with a single for_each_channel expression
computing all 4 derivative*ABCD products in one pass. Eliminates loop
overhead and enables better GCC SIMD codegen for the multiply-add chain.
…xecution sum

Fold the variance FMA (threshold + variance * factor) directly into the
output loop, eliminating one separate for_each_channel pass. The output
loop body remains small enough for effective vectorization.
…f_anisotropy for 35.599s execution sum

Remove the 0.5f central-difference scaling factor from gradient and
laplacian computation. The factor cancels in angle ratios (cos²θ, sin²θ,
cosθsinθ) since they are computed as gx²/m², gy²/m², gx·gy/m². For the
magnitude term, precompute half_anisotropy[k] = anisotropy[k] * 0.5f
outside the pixel loop. This eliminates 4 float multiplies per pixel per
channel from the hot path.
Explicit a=n0+n8,b=n2+n6 intermediates for diag_sum/diag_diff
regressed 1.69% (36.201s vs 35.599s). GCC already performs CSE;
explicit intermediates hurt register allocation.
…ielding 35.224s execution sum

Move the sums+variance for_each_channel loop (which loads all 18 LF+HF
neighbor pixels) before the gradient/laplacian angle computation loop.
This ensures all pixel values are in L1 cache when the angle loop
re-loads n1,n3,n5,n7, eliminating cache miss penalties. Also update
instructions to allow modifying diffuse-only functions in other files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant