diffuse or sharpen CPU performance tuning#20468
Draft
kofa73 wants to merge 19 commits intodarktable-org:masterfrom
Draft
diffuse or sharpen CPU performance tuning#20468kofa73 wants to merge 19 commits intodarktable-org:masterfrom
kofa73 wants to merge 19 commits intodarktable-org:masterfrom
Conversation
…ecution sum Remove neighbour_pixel_HF[9] and neighbour_pixel_LF[9] stack arrays, computing gradients, sums, and variance directly from HF/LF source pointers using pre-computed offsets. Eliminates 72 float writes and reads from stack, leveraging L1 cache spatial locality instead.
Compute cos²θ, sin²θ, and cosθ·sinθ directly from squared gradient magnitude (gx²/m², gy²/m², gx·gy/m²) instead of normalizing the gradient first then squaring. Reuses gx² and gy² already computed for the magnitude, eliminating redundant squarings and one division per gradient direction per channel.
…xecution sum Replace 4 separate compute_convolution calls with 2 paired loops: one for derivatives 0+2 (sharing gradient angle data) and one for 1+3 (sharing laplacian angle data). Each paired loop computes both LF and HF convolutions in a single pass, halving loop overhead and improving register reuse for the shared cos²θ/sin²θ/cosθsinθ values. Each derivative uses its own isotropy_type for correctness.
New tests (0087-0090) cover all three isotropy modes and their combinations. The critical test (0089-mixed) detects index-sharing bugs in paired convolution loops — proven by reproducing the exact bug found during the paired-convolution optimisation (Max dE 12.89 on buggy build vs 0.96 on correct build, while the original 0086 test passes both). - create_diffuse_tests.py: generates XMP test configs - diffuse-new-integration-tests.md: gap analysis and empirical proof - diffuse-perf-test-files/deltae: local deltae using /tmp/testenv venv
Replace the k=0..3 outer loop with a single for_each_channel expression computing all 4 derivative*ABCD products in one pass. Eliminates loop overhead and enables better GCC SIMD codegen for the multiply-add chain.
…xecution sum Fold the variance FMA (threshold + variance * factor) directly into the output loop, eliminating one separate for_each_channel pass. The output loop body remains small enough for effective vectorization.
…f_anisotropy for 35.599s execution sum Remove the 0.5f central-difference scaling factor from gradient and laplacian computation. The factor cancels in angle ratios (cos²θ, sin²θ, cosθsinθ) since they are computed as gx²/m², gy²/m², gx·gy/m². For the magnitude term, precompute half_anisotropy[k] = anisotropy[k] * 0.5f outside the pixel loop. This eliminates 4 float multiplies per pixel per channel from the hot path.
Explicit a=n0+n8,b=n2+n6 intermediates for diag_sum/diag_diff regressed 1.69% (36.201s vs 35.599s). GCC already performs CSE; explicit intermediates hurt register allocation.
… and isotropic precompute experiments
…ielding 35.224s execution sum Move the sums+variance for_each_channel loop (which loads all 18 LF+HF neighbor pixels) before the gradient/laplacian angle computation loop. This ensures all pixel values are in L1 cache when the angle loop re-loads n1,n3,n5,n7, eliminating cache miss penalties. Also update instructions to allow modifying diffuse-only functions in other files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AI disclosure: The code changes in this PR were done by coding agents. I have monitored the progress and set rules (e.g. no gaming the benchmark).
regression a regression may have sneaked in, I'm on it (double-checking the setup and if needed bisecting the commits)
So: yes, there's a difference. However, whether it's a regression is debatable The stats:
The average dE is low, 0.04446, and 87.89% of pixels have dE below the average. 0.03 % of pixels (in 24 MPx, about 7200) are above the threshold.
The reason for the difference is changed order of mathematical operations, which leads to tiny differences, which then get sharpened (remember, this sidecar contains all 21 diffuse or sharpen presets, most of them doing sharpening of some sort).
The files, so you can check this, are here (extremely noisy-looking, both of them, because of the many sharpening DoS instances). https://tech.kovacs-telekes.org/files/dt-pr-20468
The approach can be found in
diffuse-opt-instructions.md. If you find it acceptable, I'll continue with clean-up and open a PR for the additional integration tests (the existing integration test did not catch one of the mistakes the agents made). In normal images (as demonstrated by the integration tests), we don't have this. I also did not get this using the image I used for performance measurements and to detect errors, and that has 'demoisac sharpen AA', 'lens deblurring strong' and one of the community-supplied sharpening presets.The diff as 16-bit PNG (difference calculated in linear mode, PNG has gamma):

I'm happy to provide any details needed to assess if we want to move forward. Also, if we decide this is a good use of coding agents, I can repeat this exercise with the OpenCL path, as well as for other modules.
Note that despite the numerous commits, the first one brought most of the improvements, so it may make sense to only use the first few commits (see
diffuse-performance.log).Below is a measurement (only done once; the actual tuning process used a measurement loop for accuracy) to give you an idea about the results.The code was run (Release buildusingbuild.sh) on a 24 MPx image, with a history stack that contained all presets of diffuse or sharpen shipped with darktable (I hope I got them all).Clean-up still in progress (I'll remove the files used in testing, the state tracking data, the instructions for the agents etc).
New measurement (measuring the minimum of each preset over a number of iterations, with new code, not all of which has been pushed yet):