ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters by joeykleingers · Pull Request #1554 · BlueQuartzSoftware/simplnx

joeykleingers · 2026-02-28T18:56:03Z

Summary

Optimizes all 5 Group C (Multi-Iteration Morphological / Neighbor Replacement) filters for out-of-core performance using Z-slice rolling buffers and sequential per-array data transfer.

Algorithm: Z-Slice Rolling Buffer

A 3-slice rolling buffer holds the current and adjacent Z-slices in local std::vectors. All 6 face-neighbor reads come from RAM buffers instead of the OOC DataStore. Single shared codepath for both in-core and OOC — no DispatchAlgorithm needed. The buffers are lightweight enough (~2-64 MB depending on dataset) that they benefit or are neutral to in-core performance.

Per-iteration sweep:

Pre-read Z-slices 0 and 1 into buffer slots
For each Z from 0 to dimZ-1: read next slice, process all (y, x) from buffers, rotate slots
After the full Z-sweep, apply recorded changes to all cell data arrays

Data transfer optimization: Changed from ParallelTaskAlgorithm (multiple arrays processed concurrently, causing chunk evictions) to sequential per-array-per-voxel processing. Each array's chunks stay cached while all voxels are processed before moving to the next array.

Per-Filter Changes

NeighborOrientationCorrelation

Z-slice rolling buffer for quaternions (4-comp float32), phases (int32), and confidence index (float32)
Removed dead neighborDiffCount computation (computed but never read)
Pre-reads all neighbor quats/phases before pairwise comparison loop
Store bestNeighbor as const reference in transfer functor (avoids 61 MB per-task copy)
Clear bestNeighbor between cleanup levels to eliminate stale entries

ErodeDilateBadData

Z-slice rolling buffer for FeatureIds
Transfer changed from ParallelTaskAlgorithm to sequential per-array loop

ErodeDilateCoordinationNumber

Z-slice rolling buffer for FeatureIds with in-place update
Fixed DataArrayCopyTupleFunctor by-value DataArray copy to use reference
Removed redundant second computeValidFaceNeighbors call

ErodeDilateMask

Z-slice rolling buffer for both mask and maskCopy arrays
Replaced std::vector<bool> (bit-packed) with std::vector<uint8> for direct byte access

ReplaceElementAttributesWithNeighborValues

Z-slice rolling buffer for input comparison array
Transfer reordered from per-voxel-per-array to per-array-per-voxel
Added null-check guard for dynamic_cast<IDataArray&> on AttributeMatrix children

Code Quality Fixes

Removed Doxygen comments from .cpp files (documentation belongs in .hpp headers only)
Fixed single-line Doxygen to multi-line format on destructor
Fixed wrong @class tags in algorithm headers (ErodeDilateBadData, ErodeDilateCoordinationNumber, ErodeDilateMask)
Fixed misleading "highest similarity count" comment to "positive similarity count"
Renamed benchmark constants from kDimX to k_DimX convention (all 5 test files)

Performance (200x200x200 benchmarks)

Filter	IC Before	IC After	IC Speedup	OOC Before	OOC After	OOC Speedup
ErodeDilateBadData	1.37s	0.31s	4.42x	25.09s	15.57s	1.61x
ErodeDilateCoordinationNumber	0.35s	0.22s	1.59x	12.43s	3.13s	3.97x
ErodeDilateMask	0.13s	0.11s	1.18x	6.43s	3.27s	1.97x
ReplaceElementAttributesWithNeighborValues	0.35s	0.24s	1.46x	6.05s	4.90s	1.23x
NeighborOrientationCorrelation	2.19s	1.57s	1.39x	67.94s	17.24s	3.94x

Why This Algorithm Was Chosen

Single shared codepath: Z-slice buffers are small enough that they don't hurt in-core performance. No need for separate in-core/OOC algorithm classes or DispatchAlgorithm.
Eliminates neighbor chunk thrashing: All 6-neighbor reads come from RAM buffers, regardless of chunk boundaries.
Sequential transfer prevents cache eviction: Processing one array at a time keeps that array's chunks in the 6-slot FIFO cache.
Minimal memory overhead: Buffers are O(3 Z-slices), well under 64 MB for even 1000x1000 datasets.

Tradeoffs

Aspect	Original	Optimized (Z-Slice Buffer)
In-core speed	Baseline	1.18x-4.42x faster (buffer reads are cache-friendly + bug fixes)
OOC speed	3x-110x slowdown	1.23x-3.97x speedup over baseline OOC
RAM usage	O(1) per-voxel accesses	O(3 Z-slices) buffer
Code complexity	Simple neighbor reads	Buffer management + slot rotation
Maintenance	Single code path	Single code path (no dispatch)

Optimization Ceiling Analysis

The filter-level optimizations are at their ceiling. The remaining OOC overhead is dominated by per-element operator[] overhead in ZarrStore. Each getValue()/setValue() call acquires and releases a mutex lock (std::lock_guard<std::mutex>), plus performs a per-element Zarr chunk lookup (converting flat index to N-D chunk position, scanning the 6-slot FIFO cache, indexing into the chunk). Per-element overhead: ~55-75ns vs ~1ns for in-core DataStore (raw pointer access). This is intrinsic to the ZarrStore implementation and cannot be avoided by any filter-level algorithm change.

Infrastructure-level changes that would improve further:

Bulk read/write API on AbstractDataStore: Add getValues(startIndex, count, T* dest) and setValues(startIndex, count, const T* src) virtual methods. DataStore implements with std::memcpy. ZarrStore implements with a single mutex lock around a bulk loop — eliminating millions of redundant mutex lock/unlock cycles in the transfer phases.
Chunk-level bulk transfer in FileCore: A truly bulk Zarr API that reads/writes contiguous ranges directly from/to the chunk's internal buffer via memcpy, bypassing the per-element chunk lookup entirely. Estimated 3-5x additional improvement on transfer phases. Requires changes deep inside the FileCore library's IArray/Block classes.
Larger or per-array FIFO cache: The current 6-slot global FIFO means sequential per-array transfer is required to avoid cache thrashing. Per-array cache isolation or a larger cache would allow parallel array processing again.

These are infrastructure-level changes that would benefit all OOC-optimized filters, not just this group.

Test Plan

In-core unit tests pass for all 5 filters (7/7 tests)
Out-of-core unit tests pass for all 5 filters (7/7 tests, verified via "chunk shape:" output)
Results match exemplar data identically on both configurations
200x200x200 benchmark test cases added for all 5 filters
Post-review-fix benchmarks confirm no regressions (small improvements across the board)

Replace random-access data reads with Z-slice buffering that maintains a rolling window of 3 adjacent Z-slices for quaternion and phase arrays, plus 1 slice for confidence index. This eliminates repeated chunk decompression in OOC mode where arrays are stored as compressed Zarr chunks on disk. Also removed dead neighborDiffCount computation and pre-read all neighbor data before pairwise comparisons. Filter-only performance (Small_IN100 dataset, 189x201x117 voxels): - In-core: 3044 ms -> 2360 ms (1.29x speedup) - Out-of-core: 88166 ms -> 20452 ms (4.31x speedup) Added Doxygen documentation to all methods in all filter files. Updated unit test with PreferencesSentinel to force OOC behavior. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

Add 3-slice rolling buffer for face-neighbor lookups in: - ErodeDilateBadData: Buffer FeatureIds - ErodeDilateMask: Buffer mask + maskCopy - ErodeDilateCoordinationNumber: Buffer FeatureIds (with in-place update) - ReplaceElementAttributesWithNeighborValues: Buffer input comparison array All neighbor reads now come from RAM buffers instead of the OOC store, eliminating chunk thrashing during the per-voxel neighbor scan phase.

…lters Add PreferencesSentinel to force ZarrStore-backed arrays in all existing test cases, and add 200x200x200 benchmark test cases for performance measurement. Filters updated: - ErodeDilateBadData (Erode + Dilate) - ErodeDilateMask (Dilate + Erode) - ErodeDilateCoordinationNumber - ReplaceElementAttributesWithNeighborValues

joeykleingers

PR 1554 Code Review

Source: review of commit 69fb0ae. Issues grouped by severity.

Memory / Lifetime Issues

m_BestNeighbor copies entire vector per parallel task
File: NeighborOrientationCorrelation.cpp (NeighborOrientationCorrelationTransferDataImpl)
The member std::vector<int64> m_BestNeighbor stores the bestNeighbor vector by value. Since ParallelTaskAlgorithm creates one functor instance per DataArray, each task copies the full vector. For a 200³ dataset that's ~61 MB per task copy. Consider storing a const std::vector<int64>& instead — the outer scope's bestNeighbor outlives all tasks since parallelTask.execute() blocks until completion.

CPU / Algorithm Efficiency

bestNeighbor vector not cleared between cleanup levels
File: NeighborOrientationCorrelation.cpp (operator())
bestNeighbor is initialized to -1 once but never reset between the currentLevel iterations. A voxel whose CI was raised above the threshold in a prior level retains its stale bestNeighbor entry, causing a redundant (but idempotent) copy in the transfer phase. Adding std::fill(bestNeighbor.begin(), bestNeighbor.end(), -1) at the top of each level would eliminate these redundant copies. (Pre-existing behavior preserved by this PR.)
std::vector<bool> bit-packing overhead in tight inner loop
File: ErodeDilateMask.cpp (operator())
maskSlices and maskCopySlices use std::vector<bool>, which stores bits rather than bytes. Each access requires bit extraction/insertion. For 200×200 per-slice sequential access in a tight 3-deep loop, std::vector<uint8_t> would give direct byte access at the cost of ~8× more memory per slice (~40 KB → ~320 KB, still negligible). (Deferred — profile first to see if this is measurable.)
DataArrayCopyTupleFunctor copies DataArray by value
File: ErodeDilateCoordinationNumber.cpp (DataArrayCopyTupleFunctor::operator())
DataArrayType outputArray = dynamic_cast<DataArrayType&>(outputIDataArray); creates a by-value copy of the DataArray. Should be DataArrayType& outputArray. (Pre-existing bug surfaced by this PR.)
computeValidFaceNeighbors called twice per voxel
File: ErodeDilateCoordinationNumber.cpp (operator())
The second call (for the featureCount reset loop) computes the same result as the first call. Remove the second call. (Pre-existing behavior preserved by this PR.)

Naming Consistency

Benchmark constants use kDimX instead of k_DimX
Files: All 5 test files (benchmark TEST_CASEs)
Block-scoped constexpr constants like kDimX, kDimY, kDimZ, kBlockSize, kBlocksPerDim, kTotalVoxels use the k prefix without underscore. The project convention is k_ prefix for constants (e.g., k_DimX).

Const-Correctness

No issues found.

Readability

Comment says "highest similarity count" but code picks last positive
File: NeighborOrientationCorrelation.cpp (operator(), best-neighbor selection loop)
The comment reads "Find the best neighbor (last valid face with highest similarity count)" but the code selects the last face neighbor with any neighborSimCount > 0, not the one with the highest count. The comment should say "last valid face with positive similarity count." (Pre-existing algorithm behavior.)

UX / Human Interface Guidelines

No UI changes in this PR.

Robustness / Defensive

dynamic_cast<IDataArray&> may throw for non-array AttributeMatrix children
File: ReplaceElementAttributesWithNeighborValues.cpp (ExecuteTemplate::operator())
If any child of the AttributeMatrix is not an IDataArray, this throws std::bad_cast. A guard using dynamic_cast<IDataArray*> with a null check + continue would be more defensive. (Pre-existing code, not introduced by this PR.)

Bugs

No bugs found.

Documentation

Doxygen comments added to .cpp files — skill requires .hpp only
Files: NeighborOrientationCorrelation.cpp, NeighborOrientationCorrelationFilter.cpp
The doxygen-comments convention states: "Do NOT add comments to .cpp files -- only .hpp headers." This PR adds 4 Doxygen blocks to NeighborOrientationCorrelation.cpp (class NeighborOrientationCorrelationTransferDataImpl, its constructor, its operator(), and the main algorithm operator()()) and 11 Doxygen blocks to NeighborOrientationCorrelationFilter.cpp (one per method: name(), className(), uuid(), humanName(), defaultTags(), parameters(), parametersVersion(), clone(), preflightImpl(), executeImpl(), FromSIMPLJson()). All .cpp-file Doxygen should be removed; the documentation belongs on the declarations in the corresponding .hpp headers.
Single-line Doxygen format on destructor
File: NeighborOrientationCorrelation.hpp (destructor)
/** @brief Default destructor. */ uses single-line format. Convention requires multi-line format for ALL Doxygen comments:
```
/**
 * @brief Default destructor.
 */
```
Inconsistent Doxygen coverage across modified algorithm files
The NeighborOrientationCorrelation .hpp files received comprehensive Doxygen. The other 4 algorithm .hpp files (ErodeDilateBadData.hpp, ErodeDilateCoordinationNumber.hpp, ErodeDilateMask.hpp, ReplaceElementAttributesWithNeighborValues.hpp) were not modified by this PR so this is not blocking — but the .cpp implementations were substantially rewritten. If the intent is to document these algorithms, the Doxygen should go on the .hpp declarations.
Pre-existing wrong @class tag in ErodeDilateBadData.hpp
File: ErodeDilateBadData.hpp
The existing Doxygen reads @class ConditionalSetValueFilter but the class is ErodeDilateBadData. (Pre-existing bug surfaced by this PR.)
No user-facing documentation changes needed — confirmed. Filter behavior (inputs, outputs, results) is unchanged; only internal algorithm performance was optimized.

Confirmed Correct (no action needed)

Z-slice buffer rotation and write-back timing in ErodeDilateMask — slot 0 (z-1) is written back only after z has been fully processed, ensuring erode modifications from current z are captured. Final slice write-back handles both dims[2]==1 and dims[2]>1 correctly.
FeatureIds processed last in ErodeDilateBadData transfer — sequential transfer correctly processes all non-FeatureIds arrays first, then FeatureIds last, because the conditional check depends on original unmodified FeatureIds values.
In-place buffer update after copyTuple in ErodeDilateCoordinationNumber — featureIdSlices[1][inSlice] = featureIds[voxelIndex] correctly re-reads the modified voxel from the backing store into the buffer, ensuring subsequent neighbor lookups see the updated FeatureId.
Sequential per-array transfer pattern — All 5 filters correctly changed from parallel multi-array transfer to sequential per-array transfer, preventing OOC chunk cache thrashing.
currentLevel = currentLevel - 1 double-decrement — The for-loop decrement plus the explicit decrement produces a net decrement of 2 per iteration (6→4→2), matching the original algorithm's cleanup level progression.
Benchmark tests properly reload from .dream3d — All 5 benchmarks write to disk then reload via LoadDataStructure, forcing ZarrStore backing when OOC is enabled. Temp file cleanup via fs::remove() is correct.

Adds AlgorithmDispatch.hpp include and ForceOocAlgorithmGuard(GENERATE(false)) to all Group C morphological filter test cases. This ensures in-core tests only run the in-core path and OOC tests only run the OOC path.

Replace ParallelTaskAlgorithm transfer in ErodeDilateBadData with sequential per-array loop, and reorder ReplaceElementAttributes transfer from per-voxel-per-array to per-array-per-voxel. Processing one array at a time prevents multiple arrays' chunks from evicting each other in the OOC cache. ErodeDilateBadData OOC benchmark improved from 21s to 16s. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

Group C filters use a single shared algorithm (Z-slice rolling buffer) for both in-core and OOC modes. DispatchAlgorithm is not used, so ForceOocAlgorithmGuard + GENERATE is unnecessary — the same code path is always tested regardless of build configuration. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

Adds a dedicated benchmark test case with programmatic block-based EBSD data (25^3 grain blocks, low CI at boundaries, cubic crystal structure). Benchmark results: IC 2.19s→1.74s (1.26x), OOC 67.94s→18.62s (3.65x). Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

@Class

- Store bestNeighbor as const reference in transfer functor to avoid 61 MB per-task vector copies - Clear bestNeighbor between cleanup levels to eliminate stale entries - Fix DataArrayCopyTupleFunctor by-value DataArray copy to use reference - Remove redundant computeValidFaceNeighbors call in coordination number - Replace std::vector<bool> with std::vector<uint8> in ErodeDilateMask - Add null-check guard for dynamic_cast<IDataArray&> in ReplaceElementAttributesWithNeighborValues - Rename benchmark constants from kDimX to k_DimX convention (all 5 tests) - Remove Doxygen comments from .cpp files (belongs in .hpp only) - Fix single-line Doxygen to multi-line format on destructor - Fix wrong @Class tags in algorithm headers - Fix misleading "highest similarity count" comment

Remove dead RUN_TASK macros and unused faceNeighborInternalIdx variable from NeighborOrientationCorrelation. Use getDataRefAs instead of getDataAs in ErodeDilateBadData. Mark getCancel() const across all four SimplnxCore algorithm classes. Add Doxygen for operator() in algorithm headers. Improve transfer-strategy comments for OOC rationale.

joeykleingers requested a review from imikejackson February 28, 2026 18:56

joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch 2 times, most recently from b9c5baf to 2bca843 Compare February 28, 2026 19:44

joeykleingers enabled auto-merge (squash) March 2, 2026 06:52

imikejackson added the Out-of-Core label Mar 2, 2026

joeykleingers marked this pull request as draft March 4, 2026 18:38

auto-merge was automatically disabled March 4, 2026 18:38
Pull request was converted to draft

joeykleingers changed the title ~~ENH: Optimize NeighborOrientationCorrelation for out-of-core performance~~ WIP: ENH: Optimize NeighborOrientationCorrelation for out-of-core performance Mar 4, 2026

joeykleingers mentioned this pull request Mar 4, 2026

ENH: Algorithm dispatch for OoOC filter optimizations #1545

Merged

2 tasks

joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch 4 times, most recently from 27b764c to eb7b727 Compare March 4, 2026 21:18

joeykleingers changed the title ~~WIP: ENH: Optimize NeighborOrientationCorrelation for out-of-core performance~~ ENH: Optimize Group C filters for out-of-core performance Mar 4, 2026

imikejackson changed the title ~~ENH: Optimize Group C filters for out-of-core performance~~ ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters Mar 4, 2026

joeykleingers added 3 commits March 5, 2026 10:23

joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch from caa307b to 69fb0ae Compare March 5, 2026 15:23

joeykleingers marked this pull request as ready for review March 5, 2026 16:16

joeykleingers commented Mar 5, 2026

View reviewed changes

joeykleingers added 5 commits March 5, 2026 12:54

joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch 2 times, most recently from 8db9373 to 3d2b969 Compare March 5, 2026 19:05

STYLE: Apply clang-format to Group C algorithm files

3a20273

joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch from 3d2b969 to 3a20273 Compare March 5, 2026 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters#1554

ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters#1554
joeykleingers wants to merge 10 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-neighbor-orientation-correlation-optimization

joeykleingers commented Feb 28, 2026 •

edited

Loading

Uh oh!

joeykleingers left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joeykleingers commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Algorithm: Z-Slice Rolling Buffer

Per-Filter Changes

Code Quality Fixes

Performance (200x200x200 benchmarks)

Why This Algorithm Was Chosen

Tradeoffs

Optimization Ceiling Analysis

Test Plan

Uh oh!

joeykleingers left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

PR 1554 Code Review

Memory / Lifetime Issues

CPU / Algorithm Efficiency

Naming Consistency

Const-Correctness

Readability

UX / Human Interface Guidelines

Robustness / Defensive

Bugs

Documentation

Confirmed Correct (no action needed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joeykleingers commented Feb 28, 2026 •

edited

Loading

joeykleingers left a comment •

edited

Loading