diffuse or sharpen CPU performance tuning by kofa73 · Pull Request #20468 · darktable-org/darktable

kofa73 · 2026-03-08T09:31:17Z

AI disclosure: The code changes in this PR were done by coding agents. I have monitored the progress and set rules (e.g. no gaming the benchmark).

regression a regression may have sneaked in, I'm on it (double-checking the setup and if needed bisecting the commits)

So: yes, there's a difference. However, whether it's a regression is debatable The stats:

          Max dE                   : 47.05293
           Avg dE                   : 0.04446
           Std dE                   : 0.17545
           ----------------------------------
           Pixels below avg + 0 std : 87.89 %
           Pixels below avg + 1 std : 90.88 %
           Pixels below avg + 3 std : 98.62 %
           Pixels below avg + 6 std : 99.84 %
           Pixels below avg + 9 std : 99.93 %
           ----------------------------------
           Pixels above tolerance   : 0.03 %

The average dE is low, 0.04446, and 87.89% of pixels have dE below the average. 0.03 % of pixels (in 24 MPx, about 7200) are above the threshold.

The reason for the difference is changed order of mathematical operations, which leads to tiny differences, which then get sharpened (remember, this sidecar contains all 21 diffuse or sharpen presets, most of them doing sharpening of some sort).

The files, so you can check this, are here (extremely noisy-looking, both of them, because of the many sharpening DoS instances). https://tech.kovacs-telekes.org/files/dt-pr-20468

The approach can be found in diffuse-opt-instructions.md. If you find it acceptable, I'll continue with clean-up and open a PR for the additional integration tests (the existing integration test did not catch one of the mistakes the agents made). In normal images (as demonstrated by the integration tests), we don't have this. I also did not get this using the image I used for performance measurements and to detect errors, and that has 'demoisac sharpen AA', 'lens deblurring strong' and one of the community-supplied sharpening presets.

The diff as 16-bit PNG (difference calculated in linear mode, PNG has gamma):

I'm happy to provide any details needed to assess if we want to move forward. Also, if we decide this is a good use of coding agents, I can repeat this exercise with the OpenCL path, as well as for other modules.

Note that despite the numerous commits, the first one brought most of the improvements, so it may make sense to only use the first few commits (see diffuse-performance.log).

~~Below is a measurement (only done once; the actual tuning process used a measurement loop for accuracy) to give you an idea about the results.~~ The code was run (Release build using build.sh) on a 24 MPx image, with a history stack that contained all presets of diffuse or sharpen shipped with darktable (I hope I got them all).

Clean-up still in progress (I'll remove the files used in testing, the state tracking data, the instructions for the agents etc).

New measurement (measuring the minimum of each preset over a number of iterations, with new code, not all of which has been pushed yet):

preset	original	tuned	speed-up (user)	speed-up (CPU)
artistic effects / bloom	took 1.530 secs (12.640 CPU)	took 0.928 secs (6.292 CPU)	65%	101%
artistic effects / simulate line drawing	took 70.128 secs (798.174 CPU)	took 60.127 secs (663.304 CPU)	17%	20%
artistic effects / simulate watercolor	took 5.752 secs (59.985 CPU)	took 5.116 secs (51.501 CPU)	12%	16%
dehaze / extra contrast	took 18.522 secs (204.144 CPU)	took 14.021 secs (150.422 CPU)	32%	36%
dehaze / default	took 18.565 secs (205.068 CPU)	took 14.203 secs (150.615 CPU)	31%	36%
denoise / coarse	took 32.974 secs (369.817 CPU)	took 24.007 secs (258.988 CPU)	37%	43%
denoise / fine	took 21.889 secs (245.658 CPU)	took 15.957 secs (171.113 CPU)	37%	44%
denoise / medium	took 27.473 secs (308.382 CPU)	took 19.905 secs (215.217 CPU)	38%	43%
inpaint highlights	took 12.490 secs (130.758 CPU)	took 12.782 secs (128.914 CPU)	-2%	1%
lens deblur / hard	took 24.485 secs (276.125 CPU)	took 17.875 secs (193.238 CPU)	37%	43%
lens deblur / medium	took 16.564 secs (183.287 CPU)	took 12.072 secs (128.458 CPU)	37%	43%
lens deblur / soft	took 7.049 secs (76.117 CPU)	took 5.103 secs (52.862 CPU)	38%	44%
local contrast / fast	took 2.081 secs (18.038 CPU)	took 1.678 secs (12.718 CPU)	24%	42%
local contrast / fine	took 9.449 secs (99.398 CPU)	took 7.926 secs (81.531 CPU)	19%	22%
local contrast / normal	took 18.503 secs (201.509 CPU)	took 15.435 secs (165.649 CPU)	20%	22%
sharpen demosaicing / AA filter	took 1.108 secs (9.233 CPU)	took 0.998 secs (7.468 CPU)	11%	24%
sharpen demosaicing / no AA filter	took 0.875 secs (6.977 CPU)	took 0.766 secs (5.704 CPU)	14%	22%
sharpness / fast	took 1.848 secs (15.947 CPU)	took 1.485 secs (11.110 CPU)	24%	44%
sharpness / normal	took 2.360 secs (23.400 CPU)	took 2.076 secs (19.028 CPU)	14%	23%
sharpness / strong	took 4.377 secs (46.120 CPU)	took 3.779 secs (37.746 CPU)	16%	22%
surface blur	took 2.685 secs (26.367 CPU)	took 2.310 secs (20.903 CPU)	16%	26%

…ion sum

…ecution sum Remove neighbour_pixel_HF[9] and neighbour_pixel_LF[9] stack arrays, computing gradients, sums, and variance directly from HF/LF source pointers using pre-computed offsets. Eliminates 72 float writes and reads from stack, leveraging L1 cache spatial locality instead.

Compute cos²θ, sin²θ, and cosθ·sinθ directly from squared gradient magnitude (gx²/m², gy²/m², gx·gy/m²) instead of normalizing the gradient first then squaring. Reuses gx² and gy² already computed for the magnitude, eliminating redundant squarings and one division per gradient direction per channel.

…xecution sum Replace 4 separate compute_convolution calls with 2 paired loops: one for derivatives 0+2 (sharing gradient angle data) and one for 1+3 (sharing laplacian angle data). Each paired loop computes both LF and HF convolutions in a single pass, halving loop overhead and improving register reuse for the shared cos²θ/sin²θ/cosθsinθ values. Each derivative uses its own isotropy_type for correctness.

New tests (0087-0090) cover all three isotropy modes and their combinations. The critical test (0089-mixed) detects index-sharing bugs in paired convolution loops — proven by reproducing the exact bug found during the paired-convolution optimisation (Max dE 12.89 on buggy build vs 0.96 on correct build, while the original 0086 test passes both). - create_diffuse_tests.py: generates XMP test configs - diffuse-new-integration-tests.md: gap analysis and empirical proof - diffuse-perf-test-files/deltae: local deltae using /tmp/testenv venv

Replace the k=0..3 outer loop with a single for_each_channel expression computing all 4 derivative*ABCD products in one pass. Eliminates loop overhead and enables better GCC SIMD codegen for the multiply-add chain.

…xecution sum Fold the variance FMA (threshold + variance * factor) directly into the output loop, eliminating one separate for_each_channel pass. The output loop body remains small enough for effective vectorization.

…f_anisotropy for 35.599s execution sum Remove the 0.5f central-difference scaling factor from gradient and laplacian computation. The factor cancels in angle ratios (cos²θ, sin²θ, cosθsinθ) since they are computed as gx²/m², gy²/m², gx·gy/m². For the magnitude term, precompute half_anisotropy[k] = anisotropy[k] * 0.5f outside the pixel loop. This eliminates 4 float multiplies per pixel per channel from the hot path.

Explicit a=n0+n8,b=n2+n6 intermediates for diag_sum/diag_diff regressed 1.69% (36.201s vs 35.599s). GCC already performs CSE; explicit intermediates hurt register allocation.

… and isotropic precompute experiments

…ielding 35.224s execution sum Move the sums+variance for_each_channel loop (which loads all 18 LF+HF neighbor pixels) before the gradient/laplacian angle computation loop. This ensures all pixel values are in L1 cache when the angle loop re-loads n1,n3,n5,n7, eliminating cache miss penalties. Also update instructions to allow modifying diffuse-only functions in other files.

…iments

…gle, ex

kofa73 marked this pull request as draft March 8, 2026 09:31

kofa73 added 19 commits March 11, 2026 12:40

diffuse: inline structural convolution matrices for 45.631 CPU execut…

c6af872

…ion sum

diffuse: scalar gradient fetching optimization for 40.958s execution sum

4739dc2

diffuse: spatial lookup loop unroll for 40.024s execution sum

488bd4c

housekeeping files

0264f6f

diffuse: unroll derivative accumulation loop for 36.231s execution sum

712a1e1

Replace the k=0..3 outer loop with a single for_each_channel expression computing all 4 derivative*ABCD products in one pass. Eliminates loop overhead and enables better GCC SIMD codegen for the multiply-add chain.

diffuse: record failed shared diag intermediates experiment

d2c2838

Explicit a=n0+n8,b=n2+n6 intermediates for diag_sum/diag_diff regressed 1.69% (36.201s vs 35.599s). GCC already performs CSE; explicit intermediates hurt register allocation.

diffuse: record failed pre-halve cs experiment and sync tracking files

b569669

diffuse: record failed a11+a22 identity, neg_magnitude, center reuse,…

bca42b9

… and isotropic precompute experiments

diffuse: record failed fast-math pragma and fused dt_vector_exp exper…

740a0ef

…iments

perf: Reduced baseline from 35.224s to 33.355s (~5.3% drop). Fused an…

de4a40f

…gle, ex

diffuse: removed unrelated files

0860c8f

diffuse: updated progress

d54c23c

kofa73 force-pushed the diffuse-perf branch from 04fc8e7 to d54c23c Compare March 11, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffuse or sharpen CPU performance tuning#20468

diffuse or sharpen CPU performance tuning#20468
kofa73 wants to merge 19 commits intodarktable-org:masterfrom
kofa73:diffuse-perf

kofa73 commented Mar 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kofa73 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kofa73 commented Mar 8, 2026 •

edited

Loading